{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Phonological representation\n", "\n", "Here comes the plain text of the Hebrew Bible in a kind of phonological representation.\n", "We produce a handy representation to do trigram analysis on the Hebrew text.\n", "\n", "We produce two transcriptions, a simple one (``simple.txt``), blurring some of the finer masoretic distinctions, and a precise one (``phono.txt``), mapping 1-1 on the masorectic text.\n", "\n", "You can download these descriptions directly from my \n", "[SURFdrive](https://surfdrive.surf.nl/files/public.php?service=files&t=355dba3fbef111fc3ab8ac6554aaf85a)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.5.3\n", "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n", "Feature doc: http://shebanq-doc.readthedocs.org/en/latest/texts/welcome.html\n", "\n" ] } ], "source": [ "import sys\n", "import collections\n", "\n", "from laf.fabric import LafFabric\n", "from etcbc.preprocess import prepare\n", "from etcbc.lib import Transcription\n", "fabric = LafFabric()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.00s INFO: USING DATA COMPILED AT: 2015-06-29T05-30-49\n", " 2.67s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html\n", " 0.00s LOADING API with EXTRAs: please wait ... \n", " 0.00s INFO: USING DATA COMPILED AT: 2015-06-29T05-30-49\n", " 0.67s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX -- FOR TASK phono AT 2015-06-29T06-08-30\n", " 0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX -- FOR TASK phono AT 2015-06-29T06-08-30\n" ] } ], "source": [ "version = '4b'\n", "fabric.load('etcbc{}'.format(version), '--', 'phono', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype\n", " g_word_utf8 g_cons_utf8 trailer_utf8\n", " g_word g_cons lex_utf8\n", " book chapter verse label\n", " ''',''),\n", " \"primary\": True,\n", " \"prepare\": prepare,\n", "})\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Trailer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we generate the text, let's list all the different suffixes and their number of occurrences." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "trailer = collections.defaultdict(int)\n", "\n", "for node in NN(test=F.otype.v, value='word'):\n", " trailer[F.trailer_utf8.v(node)] += 1\n", "\n", "trailer_file = outfile('trailers.txt')\n", "for (trl, n) in sorted(trailer.items(), key=lambda x: (-x[1], x[0])):\n", " trailer_file.write(\"{:>7} x [{}]\\n\".format(n, trl))\n", "trailer_file.close()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 237039 x [ ]\r\n", " 121795 x []\r\n", " 42275 x [־]\r\n", " 20037 x [׃\r\n", "]\r\n", " 2266 x [ ׀ ]\r\n", " 1892 x [׃ ס\r\n", "]\r\n", " 1165 x [׃ פ\r\n", "]\r\n", " 76 x [ ס]\r\n", " 13 x [ פ]\r\n", " 7 x [׃ ׆̇\r\n", "]\r\n", " 1 x [׃ ׆̇ ס\r\n", "]\r\n", " 1 x [׃ ׆̇ פ\r\n", "]\r\n" ] } ], "source": [ "cat {my_file('trailers.txt')}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vocalized text versus consonantal text, Hebrew Unicode versus Transliteration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the complete text, note that we insert some newlines.\n", "\n", "If you want the consonantal text, replace the feature ``g_word_utf8`` by ``g_cons_utf8``.\n", "\n", "In many cases the use of Hebrew Unicode characters, however pleasing to the eye, is not preferred.\n", "Often the Hebrew occurrs embedded in non-Hebrew text, or under tree structures where the Hebrew right-to-left writing\n", "direction does not play nice with the context.\n", "Moreover, rendering software such as text editor, command prompts and browsers solve the puzzle of multiple writing directions\n", "in unpredictable ways.\n", "\n", "In those cases you can resort to a *transliteration*, with or without vowels.\n", "Use the features ``g_word`` and ``g_cons``." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Rules" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "B = b or b\n", "G = g or g\n", "D = d or dh\n", "K = k or ch\n", "P = p or f\n", "T = t or t" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "trans = Transcription()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "> alef '\n", "< ayin `\n", "v tob ṫ\n", "y tsade ŧ\n", "c shin š\n", "f sin ṡ\n", "# s(h)in ṧ\n", ": schwa ə\n", "@ qamats ā\n", "e segol è\n", "; tsere e\n", ":E hataf segol ĕ\n", ":A hataf patach à\n", ":@ hataf qamats ă\n", "ij long hireq ī\n", ";j long tsere ē\n", "ow long holam ō\n", "w. long `qibbuts` ū\n", "b. k dagesh-lene ɓ\n", "g. k dagesh-lene ɠ\n", "d. k dagesh-lene ɗ\n", "k. k dagesh-lene ƙ\n", "p. k dagesh-lene ƥ\n", "t. k dagesh-lene ƭ\n" ] } ], "source": [ "specials = (\n", " ('>', 'alef', \"'\"),\n", " ('<', 'ayin', \"`\"),\n", " ('v', 'tob', chr(0x1E6B)),\n", " ('y', 'tsade', chr(0x0167)),\n", " ('c', 'shin', chr(0x0161)),\n", " ('f', 'sin', chr(0x01E61)),\n", " ('#', 's(h)in', chr(0x1E67)),\n", " (':', 'schwa', chr(0x0259)),\n", " ('@', 'qamats', chr(0x0101)),\n", " ('e', 'segol', chr(0x00E8)),\n", " (';', 'tsere', 'e',),\n", " (':E', 'hataf segol', chr(0x0115)),\n", " (':A', 'hataf patach', chr(0x00E0)),\n", " (':@', 'hataf qamats', chr(0x0103)),\n", " ('ij', 'long hireq', chr(0x012B)),\n", " (';j', 'long tsere', chr(0x0113)),\n", " ('ow', 'long holam', chr(0x014D)),\n", " ('w.', 'long `qibbuts`', chr(0x016B)),\n", " ('b.', 'k dagesh-lene', chr(0x0253)),\n", " ('g.', 'k dagesh-lene', chr(0x0260)),\n", " ('d.', 'k dagesh-lene', chr(0x0257)),\n", " ('k.', 'k dagesh-lene', chr(0x0199)),\n", " ('p.', 'k dagesh-lene', chr(0x01A5)),\n", " ('t.', 'k dagesh-lene', chr(0x01AD)),\n", ")\n", "for (sym, let, glyph) in specials:\n", " print('{:<3} {:<10} {:>3}'.format(sym, let, glyph))" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dages_forte_lene = re.compile('([@;aeiou])([bgdkpt])\\.([:%@;aeiou])')\n", "dages_forte = re.compile('([@;aeiou])(.)\\.([:%@;aeiou])')\n", "dages_lene = re.compile('([bdgkpt])\\.')\n", "silent_aleph = re.compile(\"'(?:([^:@;aeiou])|$)\")\n", "furtive_patah = re.compile(\"([io;]|(?:w\\.)|(?:ij))([x<]|(?:h.))a$\")\n", "nm = re.compile('[0-9]+')\n", "mobile_schwa = re.compile('''\n", " (\n", " (?:^.\\.?)|\n", " (?:[ -].\\.?)|\n", " (?:.\\.)|\n", " (?::.\\.?)|\n", " (?:\n", " (?:\n", " @'?|\n", " ;j?|\n", " ij|\n", " ow|\n", " w\\.\n", " )\n", " [^:@;aeiou]\n", " )\n", " )\n", " :\n", " (?![@ae])\n", "''', re.X)\n", "mobile_schwa2 = re.compile('([^:;@aeio]):m')\n", "\n", "dagesh_lene_dict = dict(\n", " b='ɓ',\n", " g='ɠ',\n", " d='ɗ',\n", " k='ƙ',\n", " p='ƥ',\n", " t='ƭ',\n", ")\n", "\n", "def dages_forte_lene_repl(match):\n", " return match.group(1) + (dagesh_lene_dict[match.group(2)] * 2) + match.group(3)\n", "\n", "def dages_forte_lene_repl_simple(match):\n", " return match.group(1) + (match.group(2) * 2) + match.group(3)\n", "\n", "def dages_lene_repl(match):\n", " return dagesh_lene_dict[match.group(1)]\n", "\n", "def dages_lene_repl_simple(match):\n", " return match.group(1)\n", " \n", "def dages_forte_repl(match):\n", " return match.group(1) + (match.group(2) * 2) + match.group(3)\n", "\n", "def silent_alpha_repl(match):\n", " return match.group(1)\n", "\n", "def furtive_patah_repl(match):\n", " return match.group(1)+'a'+match.group(2)\n", "\n", "def furtive_patah_repl_simple(match):\n", " return match.group(1)+match.group(2)\n", "\n", "def mobile_schwa_repl(match):\n", " return match.group(1)+'%'\n", "\n", "def mobile_schwa_repl2(match):\n", " return match.group(1)+'%'+match.group(1)\n", "\n", "def phono(w):\n", " result = nm.sub('', w.lower()).replace('_', ' ')\n", " result = mobile_schwa.sub(mobile_schwa_repl, result)\n", " result = mobile_schwa2.sub(mobile_schwa_repl2, result)\n", " result = dages_forte_lene.sub(dages_forte_lene_repl, result)\n", " result = dages_forte.sub(dages_forte_repl, result)\n", " result = dages_lene.sub(dages_lene_repl, result)\n", " result = silent_aleph.sub(silent_alpha_repl, result)\n", " result = furtive_patah.sub(furtive_patah_repl, result)\n", " if result.endswith('k:'): result = result[0:-1]\n", "\n", " result = result.\\\n", " replace('>', \"'\").\\\n", " replace('<', \"`\").\\\n", " replace('v', 'ṫ').\\\n", " replace('y', 'ŧ').\\\n", " replace('c', 'š').\\\n", " replace('f', 'ṡ').\\\n", " replace('#', 'ṧ')\n", " result = result.\\\n", " replace('ij', 'ī').\\\n", " replace(';j', 'ē').\\\n", " replace('ow', 'ō').\\\n", " replace('w.', 'ū')\n", " result = result.\\\n", " replace(':a', 'à').\\\n", " replace(':@', 'ă').\\\n", " replace(':e', 'ĕ').\\\n", " replace('%', 'ə').\\\n", " replace(':', '').\\\n", " replace('@', 'ā').\\\n", " replace('e', 'è').\\\n", " replace(';', 'e')\n", " return result\n", "\n", "def simple(w):\n", " result = nm.sub('', w.lower()).replace('_', ' ')\n", " result = mobile_schwa.sub(mobile_schwa_repl, result)\n", " result = mobile_schwa2.sub(mobile_schwa_repl2, result)\n", " result = dages_forte_lene.sub(dages_forte_lene_repl_simple, result)\n", " result = dages_forte.sub(dages_forte_repl, result)\n", " result = dages_lene.sub(dages_lene_repl_simple, result)\n", " result = silent_aleph.sub(silent_alpha_repl, result)\n", " result = furtive_patah.sub(furtive_patah_repl_simple, result)\n", " if result.endswith('k:'): result = result[0:-1]\n", "\n", " result = result.\\\n", " replace('>', \"'\").\\\n", " replace('<', \"`\").\\\n", " replace('v', 'ṫ').\\\n", " replace('y', 'ŧ').\\\n", " replace('c', 'š').\\\n", " replace('f', 'ṡ').\\\n", " replace('#', 'ṧ')\n", " result = result.\\\n", " replace('ij', 'i').\\\n", " replace(';j', 'e').\\\n", " replace('ow', 'o').\\\n", " replace('w.', 'u')\n", " result = result.\\\n", " replace(':a', 'a').\\\n", " replace('a', 'a').\\\n", " replace(':@', 'a').\\\n", " replace(':e', 'e').\\\n", " replace('%', 'ə').\\\n", " replace(':', '').\\\n", " replace('@', 'a').\\\n", " replace('e', 'e').\\\n", " replace(';', 'e')\n", " return result" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "haššāmajim\n", "haššamajim\n", "\n", "roməmātəhū\n", "roməmatəhu\n", "\n", "hammə'orot\n", "hammə'orot\n", "\n", "rāqīa`\n", "raqi`\n", "\n", "wərūax\n", "wərux\n", "\n", "wajja`aṡ\n", "wajja`aṡ\n", "\n", "'axarèjkā\n", "'axarejka\n", "\n", "`al-ƥənē\n", "`al-pəne\n", "\n", "haxošèk\n", "haxošek\n", "\n", "ūləkāl-xajjat\n", "uləkal-xajjat\n", "\n", "ləhitətahəheahh\n", "ləhitətahəhehh\n", "\n" ] } ], "source": [ "tests = (\n", " 'HAC.@MA73JIm',\n", " 'RO75M:M@92T:HW.',\n", " 'HAM.:>ORO73T',\n", " 'R@QI73JA75XARE80Jk@',\n", " '