{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Accented Units\n", "\n", "Some words in the Hebrew text are contiguous with preceding or following words: there is no white space to separate them, only empty space or a maqaf (hyphen). Such a sequence of adjacent words we call an *accented unit*.\n", "\n", "How do you find, given a word occurrence, the complete accented unit it belongs to?" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.8.2\n", "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n", "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n", "\n" ] } ], "source": [ "import sys, os\n", "\n", "import laf\n", "from laf.fabric import LafFabric\n", "#from etcbc.preprocess import prepare\n", "fabric = LafFabric()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading the feature data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE\n", " 0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56\n", " 0.00s DETAIL: COMPILING a: lexicon: UP TO DATE\n", " 0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54\n", " 0.01s DETAIL: COMPILING a: para: UP TO DATE\n", " 0.01s USING annox: para DATA COMPILED AT: 2016-07-08T14-38-37\n", " 0.02s DETAIL: load main: G.node_anchor_min\n", " 0.10s DETAIL: load main: G.node_anchor_max\n", " 0.21s DETAIL: load main: G.node_sort\n", " 0.28s DETAIL: load main: G.node_sort_inv\n", " 0.68s DETAIL: load main: G.edges_from\n", " 0.74s DETAIL: load main: G.edges_to\n", " 0.80s DETAIL: load main: F.etcbc4_db_monads [node] \n", " 1.48s DETAIL: load main: F.etcbc4_db_otype [node] \n", " 2.71s DETAIL: load main: F.etcbc4_ft_g_word_utf8 [node] \n", " 2.99s DETAIL: load main: F.etcbc4_ft_trailer_utf8 [node] \n", " 3.11s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/paragraphs/__log__paragraphs.txt\n", " 3.11s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon, para FOR TASK paragraphs AT 2016-09-26T17-56-23\n" ] } ], "source": [ "version = '4b'\n", "API = fabric.load('etcbc{}'.format(version), 'lexicon,para', 'paragraphs', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype monads\n", " g_word_utf8 trailer_utf8\n", " ''',\n", " '''\n", " '''),\n", "# \"prepare\": prepare,\n", " \"primary\": False,\n", "}, verbose='DETAIL')\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Method\n", "\n", "We use `trailer_utf8` as criterion whether a word and the following word are contiguous. If it is the empty string or a maqaf, we conclude that the word in question and the next one are part of the same accented unit.\n", "\n", "It is not straightforward in LAF-Fabric how to proceed from a word node the the node of the next of previous word.\n", "\n", "A solution is then to walk through all the words in text order, and make a mapping from words to accented units.\n", "\n", "If we store that mapping, we can easily find the *au* of any word node that we encounter.\n", "\n", "# Making an index" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3.05s Compiling index of accented units ...\n", " 5.20s Assembled 426568 words into 262497 accented units\n" ] } ], "source": [ "inf('Compiling index of accented units ...')\n", "word2au = {} # the mapping from word node to accent unit\n", "aus = set() # only needed to count the total number of accented units\n", "glue = {'', '־'} # the interword material that continues the current au\n", "current_au = []\n", "for w in F.otype.s('word'):\n", " current_au.append(w)\n", " word2au[w] = current_au\n", " if F.trailer_utf8.v(w) not in glue: # move to a new au\n", " aus.add(tuple(current_au))\n", " current_au = []\n", "if current_au: aus.add(tuple(current_au))\n", "inf('Assembled {} words into {} accented units'.format(\n", " len(word2au.keys()),\n", " len(aus),\n", ")) \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Result\n", "\n", "We display the text of the first 11 verses in the Bible and mark the accented units with a a tuple of the monad numbers of their words." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "text = ''\n", "verse = 0\n", "for n in NN():\n", " otype = F.otype.v(n)\n", " if otype == 'verse':\n", " verse += 1\n", " if verse > 11: break\n", " text += '\\nGenesis 1:{}\\n'.format(verse)\n", " prev_au = None\n", " elif otype == 'word':\n", " this_au = word2au[n]\n", " if prev_au != None and this_au is not prev_au:\n", " text += ' ({}) '.format(','.join(F.monads.v(x) for x in prev_au))\n", " prev_au = this_au\n", " text += F.g_word_utf8.v(n)+F.trailer_utf8.v(n)\n", "if prev_au:\n", " text += ' ({}) '.format(','.join(F.monads.v(x) for x in prev_au))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Genesis 1:1\n", "בְּרֵאשִׁ֖ית (1,2) בָּרָ֣א (3) אֱלֹהִ֑ים (4) אֵ֥ת (5) הַשָּׁמַ֖יִם (6,7) וְאֵ֥ת (8,9) הָאָֽרֶץ׃\n", "\n", "Genesis 1:2\n", "וְהָאָ֗רֶץ (12,13,14) הָיְתָ֥ה (15) תֹ֨הוּ֙ (16) וָבֹ֔הוּ (17,18) וְחֹ֖שֶׁךְ (19,20) עַל־פְּנֵ֣י (21,22) תְהֹ֑ום (23) וְר֣וּחַ (24,25) אֱלֹהִ֔ים (26) מְרַחֶ֖פֶת (27) עַל־פְּנֵ֥י (28,29) הַמָּֽיִם׃\n", "\n", "Genesis 1:3\n", "וַיֹּ֥אמֶר (32,33) אֱלֹהִ֖ים (34) יְהִ֣י (35) אֹ֑ור (36) וַֽיְהִי־אֹֽור׃\n", "\n", "Genesis 1:4\n", "וַיַּ֧רְא (40,41) אֱלֹהִ֛ים (42) אֶת־הָאֹ֖ור (43,44,45) כִּי־טֹ֑וב (46,47) וַיַּבְדֵּ֣ל (48,49) אֱלֹהִ֔ים (50) בֵּ֥ין (51) הָאֹ֖ור (52,53) וּבֵ֥ין (54,55) הַחֹֽשֶׁךְ׃\n", "\n", "Genesis 1:5\n", "וַיִּקְרָ֨א (58,59) אֱלֹהִ֤ים ׀ (60) לָאֹור֙ (61,63) יֹ֔ום (62,64) וְלַחֹ֖שֶׁךְ (65,66,68) קָ֣רָא (67,69) לָ֑יְלָה (70) וַֽיְהִי־עֶ֥רֶב (71,72,73) וַֽיְהִי־בֹ֖קֶר (74,75,76) יֹ֥ום (77) אֶחָֽד׃ פ \n", "\n", "Genesis 1:6\n", "וַיֹּ֣אמֶר (79,80) אֱלֹהִ֔ים (81) יְהִ֥י (82) רָקִ֖יעַ (83) בְּתֹ֣וךְ (84,85) הַמָּ֑יִם (86,87) וִיהִ֣י (88,89) מַבְדִּ֔יל (90) בֵּ֥ין (91) מַ֖יִם (92) לָמָֽיִם׃\n", "\n", "Genesis 1:7\n", "וַיַּ֣עַשׂ (95,96) אֱלֹהִים֮ (97) אֶת־הָרָקִיעַ֒ (98,99,100) וַיַּבְדֵּ֗ל (101,102) בֵּ֤ין (103) הַמַּ֨יִם֙ (104,105) אֲשֶׁר֙ (106) מִתַּ֣חַת (107,108) לָרָקִ֔יעַ (109,111) וּבֵ֣ין (110,112,113) הַמַּ֔יִם (114,115) אֲשֶׁ֖ר (116) מֵעַ֣ל (117,118) לָרָקִ֑יעַ (119,121) וַֽיְהִי־כֵֽן׃\n", "\n", "Genesis 1:8\n", "וַיִּקְרָ֧א (125,126) אֱלֹהִ֛ים (127) לָֽרָקִ֖יעַ (128,130) שָׁמָ֑יִם (129,131) וַֽיְהִי־עֶ֥רֶב (132,133,134) וַֽיְהִי־בֹ֖קֶר (135,136,137) יֹ֥ום (138) שֵׁנִֽי׃ פ \n", "\n", "Genesis 1:9\n", "וַיֹּ֣אמֶר (140,141) אֱלֹהִ֗ים (142) יִקָּו֨וּ (143) הַמַּ֜יִם (144,145) מִתַּ֤חַת (146,147) הַשָּׁמַ֨יִם֙ (148,149) אֶל־מָקֹ֣ום (150,151) אֶחָ֔ד (152) וְתֵרָאֶ֖ה (153,154) הַיַּבָּשָׁ֑ה (155,156) וַֽיְהִי־כֵֽן׃\n", "\n", "Genesis 1:10\n", "וַיִּקְרָ֨א (160,161) אֱלֹהִ֤ים ׀ (162) לַיַּבָּשָׁה֙ (163,165) אֶ֔רֶץ (164,166) וּלְמִקְוֵ֥ה (167,168,169) הַמַּ֖יִם (170,171) קָרָ֣א (172) יַמִּ֑ים (173) וַיַּ֥רְא (174,175) אֱלֹהִ֖ים (176) כִּי־טֹֽוב׃\n", "\n", "Genesis 1:11\n", "וַיֹּ֣אמֶר (179,180) אֱלֹהִ֗ים (181) תַּֽדְשֵׁ֤א (182) הָאָ֨רֶץ֙ (183,184) דֶּ֔שֶׁא (185) עֵ֚שֶׂב (186) מַזְרִ֣יעַ (187) זֶ֔רַע (188) עֵ֣ץ (189) פְּרִ֞י (190) עֹ֤שֶׂה (191) פְּרִי֙ (192) לְמִינֹ֔ו (193,194) אֲשֶׁ֥ר (195) זַרְעֹו־בֹ֖ו (196,197) עַל־הָאָ֑רֶץ (198,199,200) וַֽיְהִי־כֵֽן׃\n", " (201,202,203) \n" ] } ], "source": [ "print(text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }