{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Occurrences\n", "\n", "We want to make a list of all occurrences of אֱלֹהִים with the function of the phrase they occur in.\n", "\n", "We define a function, that given a lexeme, produces a tab separated file of the occurrences of that lexeme throughout the Hebrew Bible.\n", "\n", "1. passage label\n", "1. phrase node\n", "1. phrase text\n", "1. phrase gloss\n", "1. phrase function\n", "1. lexeme\n", "1. occurrence text\n", "1. occurrence node\n", "\n", "We apply that function to the lexeme אֱלֹהִים (in ETCBC encoding: >LHJM/) to generate two concrete output files:\n", "\n", "1. with Hebrew text represented in ETCBC consonantal transcription, for ease of importing it in Excel.\n", "2. with fully pointed Hebrew text (works best in OpenOffice or Numbers)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.7.2\n", "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n", "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n", "\n" ] } ], "source": [ "import sys, os\n", "import collections\n", "\n", "import laf\n", "from laf.fabric import LafFabric\n", "from etcbc.preprocess import prepare\n", "fabric = LafFabric()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading the feature data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE\n", " 0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56\n", " 0.01s DETAIL: COMPILING a: lexicon: UP TO DATE\n", " 0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54\n", " 0.01s DETAIL: load main: G.node_anchor_min\n", " 0.14s DETAIL: load main: G.node_anchor_max\n", " 0.24s DETAIL: load main: G.node_sort\n", " 0.34s DETAIL: load main: G.node_sort_inv\n", " 0.83s DETAIL: load main: G.edges_from\n", " 0.95s DETAIL: load main: G.edges_to\n", " 1.14s DETAIL: load main: F.etcbc4_db_otype [node] \n", " 2.20s DETAIL: load main: F.etcbc4_ft_function [node] \n", " 2.33s DETAIL: load main: F.etcbc4_ft_lex [node] \n", " 2.54s DETAIL: load annox lexicon: F.etcbc4_lex_gloss [node] \n", " 2.80s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/adjectives/__log__adjectives.txt\n", " 2.83s INFO: LOADING PREPARED data: please wait ... \n", " 2.83s prep prep: G.node_sort\n", " 2.92s prep prep: G.node_sort_inv\n", " 3.43s prep prep: L.node_up\n", " 7.27s prep prep: L.node_down\n", " 13s prep prep: V.verses\n", " 13s prep prep: V.books_la\n", " 13s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html\n", " 17s INFO: LOADED PREPARED data\n", " 17s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK adjectives AT 2016-09-01T11-02-28\n" ] } ], "source": [ "version = '4b'\n", "API = fabric.load('etcbc{}'.format(version), 'lexicon', 'adjectives', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype \n", " function lex\n", " gloss\n", " ''',\n", " '''\n", " '''),\n", " \"prepare\": prepare,\n", " \"primary\": False,\n", "}, verbose='DETAIL')\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Collect data\n", "\n", "We make an index between a lexeme and all its occurrences. The index takes the shape of a dictionary with the lexemes as keys and the set of its occurrences as values. The lexeme is represented in the ETCBC transcription (we take the value of the `lex` features, and the occurrences are represented by just their nodes (which are plain integers)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 6m 30s Making occurrence index ...\n", " 6m 31s 8777 lexemes\n" ] } ], "source": [ "occurrences = collections.defaultdict(lambda: set())\n", "# a defaultdict is needed for the case where we see a lexeme for the first time.\n", "# In that case occurrences[lexeme] does not yet exist.\n", "# The defaultdict then inserts the key lexeme with value empty set into the dict.\n", "inf('Making occurrence index ...')\n", "for w in F.otype.s('word'):\n", " occurrences[F.lex.v(w)].add(w)\n", "inf('{} lexemes'.format(len(occurrences)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Retrieve the relevant bits\n", "\n", "Given the node of an occurrence, we gather all required information without much ado, and assemble it into a tuple. Because we want two output formats, we define a function that takes a format parameter, `ec` (=ETCBC consonantal) or `ha` (=fully pointed Hebrew). See the ETCBC-reference (follow the link in the cell above that has loaded the data, and look for the `T` function." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def bits(fmt, w):\n", " p = L.u('phrase', w)\n", " pw = list(L.d('word', p))\n", " return (\n", " T.passage(w),\n", " p,\n", " T.words(pw, fmt=fmt).replace('\\n', ' '),\n", " ' '.join(F.gloss.v(x) for x in pw),\n", " F.function.v(p),\n", " F.lex.v(w),\n", " T.words([w], fmt=fmt).replace('\\n', ' '),\n", " w,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating output\n", "\n", "Let us now assemble all data into the final output.\n", "We produce also a row of column headers.\n", "And we produce some statistics." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fields = '''\n", " passage\n", " phrase_node\n", " phrase_text\n", " phrase_gloss\n", " phrase_function\n", " lexeme\n", " occ_text\n", " occ_node\n", "'''.strip().split()\n", "nfields = len(fields)\n", "row_template = ('{}\\t' * (nfields - 1))+'{}\\n'\n", "of_path_template = 'occurrences_{}.{}.csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function that writes the file, given lexeme and format, and a function to produce statistics, given a lexeme." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def lex_file_name(lexeme):\n", " # in order to use the lexeme in a file name, we replace < > / [ = by harmless characters\n", " return lexeme.\\\n", " replace('/', 's').\\\n", " replace('[', 'v').\\\n", " replace('=', 'x').\\\n", " replace('<', 'o').\\\n", " replace('>', 'a')\n", "\n", "def lex_info(lexeme, fmt):\n", " file_lex = lex_file_name(lexeme)\n", " file_name = of_path_template.format(file_lex, fmt)\n", " of = open(file_name, 'w')\n", " of.write('{}\\n'.format('\\t'.join(fields)))\n", " if lexeme not in occurrences:\n", " msg('There is no lexeme \"{}\"'.format(lexeme))\n", " occs = []\n", " else:\n", " occs = sorted(occurrences[lexeme], key=NK)\n", " # sorted turns a set into a list. The order is given by the key parameter.\n", " # This is the function NK (see the ETCBC-reference. It orders nodes\n", " # according to where their associated text occurs in the Bible\n", " for w in occs:\n", " of.write(row_template.format(*bits(fmt, w)))\n", " # bits yields a tuple of values. The * unpacks this tuple in separate arguments.\n", " of.close()\n", " inf('Written {} lines to {}'.format(len(occs) + 1, file_name))\n", "\n", "def show_stats(lexeme):\n", " # we produce an overview of the distribution of the occurrences over the books\n", " # book names in Swahili\n", " book_dist = collections.Counter()\n", " if lexeme not in occurrences:\n", " msg('There is no lexeme \"{}\"'.format(lexeme))\n", " occs = []\n", " else:\n", " occs = sorted(occurrences[lexeme], key=NK)\n", " for w in occs:\n", " book_node = L.u('book', w)\n", " book_name_sw = T.book_name(book_node, lang='sw')\n", " book_name = T.book_name(book_node)\n", " book_dist['{:<30} = {}'.format(book_name_sw, book_name)] += 1\n", " # we sort the results by frequency\n", " total = 0\n", " for (b, n) in sorted(book_dist.items(), key=lambda x: (-x[1], x[0])):\n", " print('{:<10} has {:>5} occurrences in {}'.format(lexeme, n, b))\n", " total += n\n", " print('{:<10} has {:>5} occurrences in {}'.format(lexeme, total, 'the whole Bible'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we produce results for lexeme `>LHJM/` and formats `ec` and `ha`." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">LHJM/ has 374 occurrences in Kumbukumbu_la_Torati = Deuteronomy\n", ">LHJM/ has 365 occurrences in Zaburi = Psalms\n", ">LHJM/ has 219 occurrences in Mwanzo = Genesis\n", ">LHJM/ has 203 occurrences in 2_Mambo_ya_Nyakati = 2_Chronicles\n", ">LHJM/ has 145 occurrences in Yeremia = Jeremiah\n", ">LHJM/ has 139 occurrences in Kutoka = Exodus\n", ">LHJM/ has 118 occurrences in 1_Mambo_ya_Nyakati = 1_Chronicles\n", ">LHJM/ has 107 occurrences in 1_Wafalme = 1_Kings\n", ">LHJM/ has 100 occurrences in 1_Samweli = 1_Samuel\n", ">LHJM/ has 98 occurrences in 2_Wafalme = 2_Kings\n", ">LHJM/ has 94 occurrences in Isaya = Isaiah\n", ">LHJM/ has 76 occurrences in Yoshua = Joshua\n", ">LHJM/ has 73 occurrences in Waamuzi = Judges\n", ">LHJM/ has 70 occurrences in Nehemia = Nehemiah\n", ">LHJM/ has 55 occurrences in Ezra = Ezra\n", ">LHJM/ has 54 occurrences in 2_Samweli = 2_Samuel\n", ">LHJM/ has 53 occurrences in Mambo_ya_Walawi = Leviticus\n", ">LHJM/ has 40 occurrences in Mhubiri = Ecclesiastes\n", ">LHJM/ has 36 occurrences in Ezekieli = Ezekiel\n", ">LHJM/ has 27 occurrences in Hesabu = Numbers\n", ">LHJM/ has 26 occurrences in Hosea = Hosea\n", ">LHJM/ has 22 occurrences in Danieli = Daniel\n", ">LHJM/ has 17 occurrences in Ayubu = Job\n", ">LHJM/ has 16 occurrences in Yona = Jonah\n", ">LHJM/ has 14 occurrences in Amosi = Amos\n", ">LHJM/ has 11 occurrences in Mika = Micah\n", ">LHJM/ has 11 occurrences in Yoeli = Joel\n", ">LHJM/ has 11 occurrences in Zekaria = Zechariah\n", ">LHJM/ has 7 occurrences in Malaki = Malachi\n", ">LHJM/ has 5 occurrences in Mithali = Proverbs\n", ">LHJM/ has 5 occurrences in Sefania = Zephaniah\n", ">LHJM/ has 4 occurrences in Ruthi = Ruth\n", ">LHJM/ has 3 occurrences in Hagai = Haggai\n", ">LHJM/ has 2 occurrences in Habakuki = Habakkuk\n", ">LHJM/ has 1 occurrences in Nahumu = Nahum\n", ">LHJM/ has 2601 occurrences in the whole Bible\n", " 1h 03m 57s Written 2602 lines to occurrences_aLHJMs.ec.csv\n", " 1h 03m 58s Written 2602 lines to occurrences_aLHJMs.ha.csv\n" ] } ], "source": [ "lexeme = '>LHJM/'\n", "show_stats(lexeme)\n", "lex_info(lexeme, 'ec')\n", "lex_info(lexeme, 'ha')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Results\n", "[etcbc consonantal](occurrences_>LHJMo.ec.csv)\n", "and\n", "[fully pointed hebrew](occurrences_>LHJMo.ha.csv).\n", "\n", "Screenshot made in the Numbers program:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "passage\tphrase_node\tphrase_text\tphrase_gloss\tphrase_function\tlexeme\tocc_text\tocc_node\n", "Genesis 1:1\t605135\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t3\n", "Genesis 1:2\t605145\tRWX >LHJM \twind god(s)\tSubj\t>LHJM/\t>LHJM \t25\n", "Genesis 1:3\t605150\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t33\n", "Genesis 1:4\t605158\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t41\n", "Genesis 1:4\t605164\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t49\n", "Genesis 1:5\t605168\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t59\n", "Genesis 1:6\t605184\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t80\n", "Genesis 1:7\t605194\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t96\n", "Genesis 1:8\t605208\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t126\n", "Genesis 1:9\t605220\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t141\n", "Genesis 1:10\t605232\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t161\n", "Genesis 1:10\t605241\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t175\n", "Genesis 1:11\t605246\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t180\n", "Genesis 1:12\t605277\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t224\n", "Genesis 1:14\t605289\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t237\n", "Genesis 1:16\t605309\t>LHJM \tgod(s)\tSubj\t>LHJM/\t>LHJM \t283\n", "Genesis 1:17\n" ] } ], "source": [ "print(open(of_path_template.format(lex_file_name(lexeme), 'ec')).read()[0:1000])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }