<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Phonological representation

Here comes the plain text of the Hebrew Bible in a kind of phonological representation.
We produce a handy representation to do trigram analysis on the Hebrew text.

We produce two transcriptions, a simple one (``simple.txt``), blurring some of the finer masoretic distinctions, and a precise one (``phono.txt``), mapping 1-1 on the masorectic text.

You can download these descriptions directly from my 
[SURFdrive](https://surfdrive.surf.nl/files/public.php?service=files&t=355dba3fbef111fc3ab8ac6554aaf85a).

In [1]:
import sys
import collections

from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.lib import Transcription
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: http://shebanq-doc.readthedocs.org/en/latest/texts/welcome.html



In [3]:
version = '4b'
fabric.load('etcbc{}'.format(version), '--', 'phono', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype
        g_word_utf8 g_cons_utf8 trailer_utf8
        g_word g_cons lex_utf8
        book chapter verse label
    ''',''),
    "primary": True,
    "prepare": prepare,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-06-29T05-30-49
  2.67s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
  0.00s LOADING API with EXTRAs: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-06-29T05-30-49
  0.67s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX -- FOR TASK phono AT 2015-06-29T06-08-30
  0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX -- FOR TASK phono AT 2015-06-29T06-08-30


## Trailer

Before we generate the text, let's list all the different suffixes and their number of occurrences.

In [4]:
trailer = collections.defaultdict(int)

for node in NN(test=F.otype.v, value='word'):
    trailer[F.trailer_utf8.v(node)] += 1

trailer_file = outfile('trailers.txt')
for (trl, n) in sorted(trailer.items(), key=lambda x: (-x[1], x[0])):
    trailer_file.write("{:>7} x [{}]\n".format(n, trl))
trailer_file.close()

In [5]:
cat {my_file('trailers.txt')}

 237039 x [ ]
 121795 x []
  42275 x [־]
  20037 x [׃
]
   2266 x [ ׀ ]
   1892 x [׃ ס
]
   1165 x [׃ פ
]
     76 x [ ס]
     13 x [ פ]
      7 x [׃ ׆̇
]
      1 x [׃ ׆̇ ס
]
      1 x [׃ ׆̇ פ
]


## Vocalized text versus consonantal text, Hebrew Unicode versus Transliteration

Now the complete text, note that we insert some newlines.

If you want the consonantal text, replace the feature ``g_word_utf8`` by ``g_cons_utf8``.

In many cases the use of Hebrew Unicode characters, however pleasing to the eye, is not preferred.
Often the Hebrew occurrs embedded in non-Hebrew text, or under tree structures where the Hebrew right-to-left writing
direction does not play nice with the context.
Moreover, rendering software such as text editor, command prompts and browsers solve the puzzle of multiple writing directions
in unpredictable ways.

In those cases you can resort to a *transliteration*, with or without vowels.
Use the features ``g_word`` and ``g_cons``.

# Rules

B = b or b
G = g or g
D = d or dh
K = k or ch
P = p or f
T = t or t

In [6]:
import re
trans = Transcription()

In [7]:
specials = (
    ('>', 'alef', "'"),
    ('<', 'ayin', "`"),
    ('v', 'tob', chr(0x1E6B)),
    ('y', 'tsade', chr(0x0167)),
    ('c', 'shin', chr(0x0161)),
    ('f', 'sin', chr(0x01E61)),
    ('#', 's(h)in', chr(0x1E67)),
    (':', 'schwa', chr(0x0259)),
    ('@', 'qamats', chr(0x0101)),
    ('e', 'segol', chr(0x00E8)),
    (';', 'tsere', 'e',),
    (':E', 'hataf segol', chr(0x0115)),
    (':A', 'hataf patach', chr(0x00E0)),
    (':@', 'hataf qamats', chr(0x0103)),
    ('ij', 'long hireq', chr(0x012B)),
    (';j', 'long tsere', chr(0x0113)),
    ('ow', 'long holam', chr(0x014D)),
    ('w.', 'long `qibbuts`', chr(0x016B)),
    ('b.', 'k dagesh-lene', chr(0x0253)),
    ('g.', 'k dagesh-lene', chr(0x0260)),
    ('d.', 'k dagesh-lene', chr(0x0257)),
    ('k.', 'k dagesh-lene', chr(0x0199)),
    ('p.', 'k dagesh-lene', chr(0x01A5)),
    ('t.', 'k dagesh-lene', chr(0x01AD)),
)
for (sym, let, glyph) in specials:
    print('{:<3} {:<10} {:>3}'.format(sym, let, glyph))

>   alef         '
<   ayin         `
v   tob          ṫ
y   tsade        ŧ
c   shin         š
f   sin          ṡ
#   s(h)in       ṧ
:   schwa        ə
@   qamats       ā
e   segol        è
;   tsere        e
:E  hataf segol   ĕ
:A  hataf patach   à
:@  hataf qamats   ă
ij  long hireq   ī
;j  long tsere   ē
ow  long holam   ō
w.  long `qibbuts`   ū
b.  k dagesh-lene   ɓ
g.  k dagesh-lene   ɠ
d.  k dagesh-lene   ɗ
k.  k dagesh-lene   ƙ
p.  k dagesh-lene   ƥ
t.  k dagesh-lene   ƭ


In [81]:
dages_forte_lene = re.compile('([@;aeiou])([bgdkpt])\.([:%@;aeiou])')
dages_forte = re.compile('([@;aeiou])(.)\.([:%@;aeiou])')
dages_lene = re.compile('([bdgkpt])\.')
silent_aleph = re.compile("'(?:([^:@;aeiou])|$)")
furtive_patah = re.compile("([io;]|(?:w\.)|(?:ij))([x<]|(?:h.))a$")
nm = re.compile('[0-9]+')
mobile_schwa = re.compile('''
    (
        (?:^.\.?)|
        (?:[ -].\.?)|
        (?:.\.)|
        (?::.\.?)|
        (?:
            (?:
                @'?|
                ;j?|
                ij|
                ow|
                w\.
            )
            [^:@;aeiou]
        )
    )
    :
    (?![@ae])
''', re.X)
mobile_schwa2 = re.compile('([^:;@aeio]):m')

dagesh_lene_dict = dict(
    b='ɓ',
    g='ɠ',
    d='ɗ',
    k='ƙ',
    p='ƥ',
    t='ƭ',
)

def dages_forte_lene_repl(match):
    return match.group(1) + (dagesh_lene_dict[match.group(2)] * 2) + match.group(3)

def dages_forte_lene_repl_simple(match):
    return match.group(1) + (match.group(2) * 2) + match.group(3)

def dages_lene_repl(match):
    return dagesh_lene_dict[match.group(1)]

def dages_lene_repl_simple(match):
    return match.group(1)
                              
def dages_forte_repl(match):
    return match.group(1) + (match.group(2) * 2) + match.group(3)

def silent_alpha_repl(match):
    return match.group(1)

def furtive_patah_repl(match):
    return match.group(1)+'a'+match.group(2)

def furtive_patah_repl_simple(match):
    return match.group(1)+match.group(2)

def mobile_schwa_repl(match):
    return match.group(1)+'%'

def mobile_schwa_repl2(match):
    return match.group(1)+'%'+match.group(1)

def phono(w):
    result = nm.sub('', w.lower()).replace('_', ' ')
    result = mobile_schwa.sub(mobile_schwa_repl, result)
    result = mobile_schwa2.sub(mobile_schwa_repl2, result)
    result = dages_forte_lene.sub(dages_forte_lene_repl, result)
    result = dages_forte.sub(dages_forte_repl, result)
    result = dages_lene.sub(dages_lene_repl, result)
    result = silent_aleph.sub(silent_alpha_repl, result)
    result = furtive_patah.sub(furtive_patah_repl, result)
    if result.endswith('k:'): result = result[0:-1]

    result = result.\
        replace('>', "'").\
        replace('<', "`").\
        replace('v', 'ṫ').\
        replace('y', 'ŧ').\
        replace('c', 'š').\
        replace('f', 'ṡ').\
        replace('#', 'ṧ')
    result = result.\
        replace('ij', 'ī').\
        replace(';j', 'ē').\
        replace('ow', 'ō').\
        replace('w.', 'ū')
    result = result.\
        replace(':a', 'à').\
        replace(':@', 'ă').\
        replace(':e', 'ĕ').\
        replace('%', 'ə').\
        replace(':', '').\
        replace('@', 'ā').\
        replace('e', 'è').\
        replace(';', 'e')
    return result

def simple(w):
    result = nm.sub('', w.lower()).replace('_', ' ')
    result = mobile_schwa.sub(mobile_schwa_repl, result)
    result = mobile_schwa2.sub(mobile_schwa_repl2, result)
    result = dages_forte_lene.sub(dages_forte_lene_repl_simple, result)
    result = dages_forte.sub(dages_forte_repl, result)
    result = dages_lene.sub(dages_lene_repl_simple, result)
    result = silent_aleph.sub(silent_alpha_repl, result)
    result = furtive_patah.sub(furtive_patah_repl_simple, result)
    if result.endswith('k:'): result = result[0:-1]

    result = result.\
        replace('>', "'").\
        replace('<', "`").\
        replace('v', 'ṫ').\
        replace('y', 'ŧ').\
        replace('c', 'š').\
        replace('f', 'ṡ').\
        replace('#', 'ṧ')
    result = result.\
        replace('ij', 'i').\
        replace(';j', 'e').\
        replace('ow', 'o').\
        replace('w.', 'u')
    result = result.\
        replace(':a', 'a').\
        replace('a', 'a').\
        replace(':@', 'a').\
        replace(':e', 'e').\
        replace('%', 'ə').\
        replace(':', '').\
        replace('@', 'a').\
        replace('e', 'e').\
        replace(';', 'e')
    return result

In [82]:
tests = (
    'HAC.@MA73JIm',
    'RO75M:M@92T:HW.',
    'HAM.:>ORO73T',
    'R@QI73J<A',
    'W:R74W.XA',
    'WAJ.A74<AF',
    '>A75XARE80Jk@',
    '<AL-P.:N;74J',
    'HAXO75CEk:',
    'W.75L:K@L-XAJ.A74T',
    'L:HIT:MAH:M;80H.A',
)
for test in tests:
    print('{}\n{}\n'.format(phono(test), simple(test)))


haššāmajim
haššamajim

roməmātəhū
roməmatəhu

hammə'orot
hammə'orot

rāqīa`
raqi`

wərūax
wərux

wajja`aṡ
wajja`aṡ

'axarèjkā
'axarejka

`al-ƥənē
`al-pəne

haxošèk
haxošek

ūləkāl-xajjat
uləkal-xajjat

ləhitətahəheahh
ləhitətahəhehh



In [84]:
orig_file = outfile("orig.txt")
phono_file = outfile("phono.txt")
simple_file = outfile('simple.txt')
for v in F.otype.s('verse'):
    passage_label = '{} {}:{}'.format(
        F.book.v(L.u('book', v)), 
        F.chapter.v(L.u('chapter', v)),
        F.verse.v(v),
    )
    phono_file.write('{}  '.format(passage_label))
    orig_file.write('{}  '.format(passage_label))
    simple_file.write('{}  '.format(passage_label))

    clause_atoms = L.d('clause_atom', v)
    verse_text = ''
    for c in clause_atoms:
        the_sep = ''
        words = L.d('word', c)
        for w in words:
            the_text = F.g_word_utf8.v(w)
            the_trailer = F.trailer_utf8.v(w)
            the_sep = '-' if '־' in the_trailer else '\t' if '׃' in the_trailer else ' ' if ' ' in the_trailer else ''
            the_newline = '\n' if '\n' in the_trailer else ''
            verse_text += trans.from_hebrew(the_text) + the_sep + the_newline
        if the_sep == ' ':
            verse_text = verse_text.rstrip(' ')
        if the_sep not in {'\n', '\t'}:
            verse_text += ','        
        
    paras = verse_text.split('\n')
    for para in paras:
        if para == '': continue
        sentences = para.split('\t')
        for sentence in sentences:
            if sentence == '': continue
            ca_phonos = []
            ca_origs = []
            ca_simples = []
            clause_atoms = sentence.split(',')
            for clause_atom in clause_atoms:
                if clause_atom in {'', ' '}: continue
                chunks = clause_atom.split(' ')
                phonos = []
                origs = []
                simples = []
                for chunk in chunks:
                    phonos.append(phono(chunk))
                    origs.append(chunk)
                    simples.append(simple(chunk))
                ca_phonos.append('{}'.format(' '.join(phonos)))
                ca_origs.append('{}'.format(' '.join(origs)))
                ca_simples.append('{}'.format(' '.join(simples)))
            phono_file.write('{}.'.format(', '.join(ca_phonos)))
            orig_file.write('{}.'.format(', '.join(ca_origs)))
            simple_file.write('{}.'.format(', '.join(ca_simples)))
        phono_file.write('\n')
        orig_file.write('\n')
        simple_file.write('\n')
phono_file.close()
orig_file.close()
simple_file.close()