{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Rare lexemes in parallels\n", "\n", "Joint work of Kenneth Bergland, Martijn Naaijer and Dirk Roorda.\n", "\n", "We try to find rare combinations of words and see how they occur in Thora books and in the Prophets." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.5.5\n", "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n", "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n", "\n" ] } ], "source": [ "import sys,os\n", "import collections, difflib\n", "from IPython.display import HTML, display_pretty, display_html\n", "from difflib import SequenceMatcher\n", "\n", "import laf\n", "from laf.fabric import LafFabric\n", "from etcbc.preprocess import prepare\n", "fabric = LafFabric()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "source = 'etcbc'\n", "version = '4b'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56\n", " 0.00s INFO: USING DATA COMPILED AT: 2015-11-03T06-44-21\n", " 4.17s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/rare_lexemes/__log__rare_lexemes.txt\n", " 13s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html\n", " 0.00s LOADING API with EXTRAs: please wait ... \n", " 0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56\n", " 0.00s INFO: USING DATA COMPILED AT: 2015-11-03T06-44-21\n", " 0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK rare_lexemes AT 2015-12-17T10-55-18\n", " 0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK rare_lexemes AT 2015-12-17T10-55-18\n" ] } ], "source": [ "API = fabric.load(source+version, 'lexicon', 'rare_lexemes', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype\n", " language lex g_cons g_word_utf8 trailer_utf8 phono phono_sep\n", " sp\n", " book chapter verse label number\n", " ''',''),\n", " \"prepare\": prepare,\n", " \"primary\": False,\n", "}, verbose='NORMAL')\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "book_classes = dict(\n", " legal = set('''\n", " Exodus Leviticus Deuteronomium\n", " '''.strip().split()),\n", " prophets = set('''\n", " Jesaia Jeremia\n", " Ezechiel Hosea\n", " Joel Amos Obadia Jona Micha Nahum Habakuk Zephania Haggai Sacharia Maleachi\n", " '''.strip().split())\n", ")\n", "book_classes_index = {}\n", "for (cl, books) in book_classes.items():\n", " for book in books:\n", " book_classes_index[book] = cl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Leave out non-content words\n", "\n", "It is a bit sorry to have all those bigrams involving an article, for example.\n", "So, as an option, we leave out the non content words, being words with a certain part-of-speech.\n", "\n", " X art\tarticle\n", " verb\tverb\n", " subs\tnoun\n", " nmpr\tproper noun\n", " advb\tadverb\n", " X prep\tpreposition\n", " X conj\tconjunction\n", " X prps\tpersonal pronoun\n", " X prde\tdemonstrative pronoun\n", " X prin\tinterrogative pronoun\n", " X intj\tinterjection\n", " X nega\tnegative particle\n", " X inrg\tinterrogative particle\n", " adjv\tadjective" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ONLY_SP = set('''\n", " verb subs nmpr advb adjv\n", "'''.strip().split())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gather bi- and tri-grams\n", "\n", "We create a big set of bi-tri-grams and count their frequencies.\n", "We only consider bi-tri-grams that of which all members are part of the same clause.\n", "\n", "We want to know the frequencies in the thora, in the prophets, and in the whole bible.\n", "\n", "## Get all the grams" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 11s Collecting all <= 4-grams\n", " 16s Done\n" ] } ], "source": [ "grams = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.Counter()))\n", "ng = 4 # the max number of members in the n-gram\n", "msg('Collecting all <= {}-grams'.format(ng))\n", "for c in F.otype.s('clause'):\n", " bk = F.book.v(L.u('book', c))\n", " words = list(L.d('word', c))\n", " lenw = len(words)\n", " for n in range(1, ng + 1):\n", " for i in range(lenw - n):\n", " this_gram = '-'.join(F.lex.v(w) if F.sp.v(w) in ONLY_SP else '*' for w in (words[j] for j in range(i, n)))\n", " grams[n][bk][this_gram] += 1\n", "msg('Done')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Count and rank the grams\n", "\n", "Now compute the counts for the indicated groups of books." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 18s Counting grams\n", " 19s Done\n" ] } ], "source": [ "freqs_by_g = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.Counter()))\n", "freqs_by_c = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.Counter()))\n", "\n", "msg('Counting grams')\n", "for n in grams:\n", " ngrams = grams[n]\n", " for bk in ngrams:\n", " cl = book_classes_index.get(bk, 'bible')\n", " these_grams = ngrams[bk]\n", " for g in these_grams:\n", " freqs_by_g[n][g][cl] += these_grams[g]\n", " freqs_by_c[n][cl][g] += these_grams[g]\n", "msg('Done')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Filter the grams\n", "\n", "We set a threshold below which grams are considered *rare* and another threshold above which grams are considered abundant.\n", "\n", "Then we list all grams that are abundant in one collection of books but rare in the other one." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 21s Filtering\n", " 24s Done\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "RARE means <= 1\n", "\t1-grams: rr ra ar aa = 1866 423 88 141\n", "\t2-grams: rr ra ar aa = 10124 1276 430 700\n", "\t3-grams: rr ra ar aa = 29657 1933 1116 723\n", "\t4-grams: rr ra ar aa = 42812 1541 1393 458\n", "RARE means <= 2\n", "\t1-grams: rr ra ar aa = 2120 262 43 93\n", "\t2-grams: rr ra ar aa = 10997 808 241 484\n", "\t3-grams: rr ra ar aa = 31502 927 535 465\n", "\t4-grams: rr ra ar aa = 44768 603 583 250\n", "RARE means <= 3\n", "\t1-grams: rr ra ar aa = 2232 181 28 77\n", "\t2-grams: rr ra ar aa = 11403 566 178 383\n", "\t3-grams: rr ra ar aa = 32193 555 346 335\n", "\t4-grams: rr ra ar aa = 45384 309 351 160\n", "RARE means <= 4\n", "\t1-grams: rr ra ar aa = 2307 124 24 63\n", "\t2-grams: rr ra ar aa = 11643 433 135 319\n", "\t3-grams: rr ra ar aa = 32525 383 262 259\n", "\t4-grams: rr ra ar aa = 45616 215 246 127\n", "RARE means <= 5\n", "\t1-grams: rr ra ar aa = 2352 96 18 52\n", "\t2-grams: rr ra ar aa = 11803 334 115 278\n", "\t3-grams: rr ra ar aa = 32721 295 203 210\n", "\t4-grams: rr ra ar aa = 45781 153 173 97\n", "RARE means <= 6\n", "\t1-grams: rr ra ar aa = 2375 81 20 42\n", "\t2-grams: rr ra ar aa = 11903 283 97 247\n", "\t3-grams: rr ra ar aa = 32855 245 152 177\n", "\t4-grams: rr ra ar aa = 45879 123 116 86\n", "RARE means <= 7\n", "\t1-grams: rr ra ar aa = 2386 77 21 34\n", "\t2-grams: rr ra ar aa = 11974 261 82 213\n", "\t3-grams: rr ra ar aa = 32950 203 121 155\n", "\t4-grams: rr ra ar aa = 45929 101 101 73\n", "RARE means <= 8\n", "\t1-grams: rr ra ar aa = 2407 64 19 28\n", "\t2-grams: rr ra ar aa = 12029 230 82 189\n", "\t3-grams: rr ra ar aa = 33018 168 107 136\n", "\t4-grams: rr ra ar aa = 45980 76 82 66\n", "RARE means <= 9\n", "\t1-grams: rr ra ar aa = 2423 52 17 26\n", "\t2-grams: rr ra ar aa = 12076 209 76 169\n", "\t3-grams: rr ra ar aa = 33067 147 92 123\n", "\t4-grams: rr ra ar aa = 46016 58 67 63\n" ] } ], "source": [ "RARE = 1\n", "\n", "class_a = 'legal'\n", "class_b = 'prophets'\n", "\n", "results = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.Counter()))\n", "\n", "msg('Filtering')\n", "for RARE in range(1, 10):\n", " for n in sorted(freqs_by_g):\n", " nfreqs_by_g = freqs_by_g[n]\n", " for g in nfreqs_by_g:\n", " g_info = nfreqs_by_g[g]\n", " nc = dict((cl, g_info.get(cl, 0)) for cl in book_classes)\n", " if nc[class_a] <= RARE and nc[class_b] <= RARE: results[RARE][n]['rr'] += 1 \n", " elif nc[class_a] <= RARE and nc[class_b] > RARE: results[RARE][n]['ra'] += 1\n", " elif nc[class_a] > RARE and nc[class_b] <= RARE: results[RARE][n]['ar'] += 1\n", " elif nc[class_a] > RARE and nc[class_b] > RARE: results[RARE][n]['aa'] += 1\n", " else: results[RARE][n]['xx'] += 1\n", "msg('Done')\n", "\n", "for RARE in sorted(results):\n", " print('RARE means <= {}'.format(RARE))\n", " for n in range(1, ng + 1):\n", " print('\\t{}-grams: rr ra ar aa = {}'.format(\n", " n, \n", " ' '.join(\n", " '{:>5}'.format(results[RARE].get(n, {}).get(y, 0)) for y in ('rr', 'ra', 'ar', 'aa')\n", " ),\n", " ))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Verse index" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "21m 38s Making a mapping between a passage specification and a verse node\n", "21m 39s 23213 verses\n" ] } ], "source": [ "msg(\"Making a mapping between a passage specification and a verse node\")\n", "passage2vnode = {}\n", "for vs in F.otype.s('verse'):\n", " lab = (F.book.v(vs), F.chapter.v(vs), F.verse.v(vs))\n", " passage2vnode[lab] = vs\n", "msg(\"{} verses\".format(len(passage2vnode)))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Ways forward\n", "\n", "1. restrict to indicated chapters in each of the books (that correspond to the agreed genre of *legal inside thora*)\n", "2. compute unique parallels between different collections in order to get a baseline of how many to expect.\n", "3. alternatively, use parallel passages to find parallel phrase and then focus on things that are rare.\n", "\n", "# Other indicators\n", "\n", "Look for *update lexemes*, where in parallel phrases one lexeme is different, for example measure nouns.\n", "See for example Ezek 45:10 - Lev 19:36 - Deut 25:15.\n", "\n", "This is a case of parallel language use, not strictly a *parallel passage*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Difference lexemes\n", "\n", "We are going to list the lexemes that make a difference between parallel passages.\n", "For every set of parallel passages we list the lexemes that do not occur in their intersection.\n", "\n", "We print a list of all groups of parallel verses, where the lexemes that do not occur in all members of the group are highlighted.\n", "\n", "Every lexeme that occurs in some parallel passage, but not in the intersection of the lexemes of its parallel passages will be listed in a dictionary, keyed by that lexeme, and then by the group number, and then it has the following information: passages in which it does occur, and passages in which it does not occur." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "21m 43s /Users/dirk/SURFdrive/current/demos/github/shebanq/static/docs/tools/parallel/files/crossrefdb.csv\n" ] } ], "source": [ "CROSSREF_TOOL = 'parallel'\n", "CROSSREF_DB_FILE = 'crossrefdb.csv'\n", "SHEBANQ_PATH = os.path.abspath('{}/../../../shebanq'.format(os.getcwd))\n", "CROSSREF_DB_DIR = '{}/static/docs/tools/{}/files'.format(SHEBANQ_PATH, CROSSREF_TOOL)\n", "CROSSREF_DB_PATH = '{}/{}'.format(CROSSREF_DB_DIR, CROSSREF_DB_FILE)\n", "msg(CROSSREF_DB_PATH)\n", "PRETTY_PAIRS = 'pairs.html'\n", "DIFF_LEX = 'difflex.csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Results" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "parallel_pairs\n", "difference lexemes\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loc_tpl = 'https://rawgit.com/etcbc/laf-fabric-nbs/master/lingvar/{}'\n", "HTML('''\n", "parallel_pairs\n", "difference lexemes\n", "'''.format(loc_tpl.format(PRETTY_PAIRS), loc_tpl.format(DIFF_LEX)))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "21m 52s Reading crossrefs database\n", "21m 52s 14354 entries read\n", "21m 52s Gathered 1235 sets of parallel verses\n" ] } ], "source": [ "msg('Reading crossrefs database')\n", "n = 0\n", "parallel_pairs_proto = collections.defaultdict(lambda: set())\n", "group_index = {}\n", "group_number = 0\n", "with open(CROSSREF_DB_PATH) as f:\n", " for line in f:\n", " n += 1\n", " if n == 1: continue\n", " (bkx, chx, vsx, bky, chy, vsy, rd) = line.rstrip('\\n').split('\\t')\n", " vx = passage2vnode[(bkx, chx, vsx)]\n", " vy = passage2vnode[(bky, chy, vsy)]\n", " gx = group_index.get(vx, None)\n", " gy = group_index.get(vy, None)\n", " if gx == None and gy == None:\n", " group_number += 1\n", " group_index[vx] = group_number\n", " group_index[vy] = group_number\n", " elif gx == None:\n", " group_index[vx] = gy\n", " elif gy == None:\n", " group_index[vy] = gx\n", " elif gx != gy:\n", " update = [x for x in group_index if group_index[x] == gx]\n", " for x in update: group_index[x] = gy\n", "msg('{} entries read'.format(n))\n", "\n", "for (x, n) in group_index.items(): parallel_pairs_proto[n].add(x)\n", "parallel_pairs = [sorted(parallel_pairs_proto[n]) for n in parallel_pairs_proto]\n", "parallel_pairs_proto = None\n", "msg('Gathered {} sets of parallel verses'.format(len(parallel_pairs)))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "21m 59s Building a difference lexeme table\n", "22m 00s 1235 groups in table\n" ] } ], "source": [ "msg('Building a difference lexeme table')\n", "intersections = []\n", "unions = []\n", "for p in parallel_pairs:\n", " clex = None\n", " ulex = collections.defaultdict(lambda: set())\n", " for x in p:\n", " xlex = {F.lex.v(w) for w in L.d('word', x)}\n", " clex = xlex if clex == None else clex & xlex\n", " for lex in xlex:\n", " ulex[lex].add(x)\n", " intersections.append(clex)\n", " unions.append(ulex)\n", "msg('{} groups in table'.format(len(intersections)))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "css = '''\n", "\n", "'''\n", "\n", "html_file_tpl = '''\n", "\n", "\n", "{}\n", "{}\n", "\n", "\n", "

Table of groups

\n", "{}\n", "

Groups

\n", "{}\n", "\n", "'''" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# we want to sort passages in such a way that the verses in 1 Kings 19-26 are put before anything else\n", "\n", "def print_chunk(i, clex, members):\n", " result = ['''\n", "Group {i}\n", "\n", "'''.format(i=i)]\n", " for v in members:\n", " lab = '{} {}:{}'.format(F.book.v(v), F.chapter.v(v), F.verse.v(v))\n", " text = ''.join(\n", " (F.g_word_utf8.v(w)\\\n", " if F.lex.v(w) in clex \\\n", " else '{}'.format(F.g_word_utf8.v(w))\n", " )+F.trailer_utf8.v(w) for w in L.d('word', v)\n", " )\n", " result.append('''\n", "\n", "'''.format(\n", " rb=lab,\n", " rl=text,\n", " ))\n", " result.append('''
{rb}{rl}
\n", "''')\n", " return ''.join(result)\n", "\n", "def index_chunk(i, clex, members):\n", " verse_labels = ', '.join('{} {}:{}'.format(F.book.v(v), F.chapter.v(v), F.verse.v(v)) for v in members)\n", " return '

{i} {vl}

\\n'.format(\n", " vl = verse_labels, i=i,\n", " )" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "22m 21s Pretty printing the table\n", "22m 21s Done\n" ] } ], "source": [ "msg('Pretty printing the table')\n", "allgeni_html = []\n", "allgenh_html = []\n", "for (i, p) in enumerate(parallel_pairs):\n", " clex = intersections[i]\n", " allgeni_html.append(index_chunk(i, clex, p))\n", " allgenh_html.append(print_chunk(i, clex, p))\n", "outf = open('pairs.html', 'w')\n", "outf.write(html_file_tpl.format(\n", " 'Pairs',\n", " css,\n", " ''.join(allgeni_html),\n", " ''.join(allgenh_html),\n", "))\n", "outf.close()\n", "msg('Done')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The table with difference lexemes" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "diff_pairs = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))\n", "for (i, p) in enumerate(parallel_pairs):\n", " clex = intersections[i]\n", " ulex = unions[i]\n", " for lex in ulex:\n", " if lex in clex: continue\n", " diff_pairs[lex][i]['has'] = ulex[lex]\n", " diff_pairs[lex][i]['hasnot'] = set(p) - ulex[lex]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "outf = open('difflex.csv', 'w')\n", "outf.write('lexeme\\tgroup\\tsize\\thas?\\tbook\\tchapter\\tverse\\n')\n", "for lex in sorted(diff_pairs):\n", " for i in sorted(diff_pairs[lex]):\n", " lnp = len(diff_pairs[lex][i]['has']) + len(diff_pairs[lex][i]['hasnot'])\n", " for v in sorted(diff_pairs[lex][i]['has']):\n", " outf.write('{}\\t{}\\t{}\\t+\\t{}\\t{}\\t{}\\n'.format(\n", " lex, i, lnp, F.book.v(v), F.chapter.v(v), F.verse.v(v), \n", " ))\n", " for v in sorted(diff_pairs[lex][i]['hasnot']):\n", " outf.write('{}\\t{}\\t{}\\t-\\t{}\\t{}\\t{}\\n'.format(\n", " lex, i, lnp, F.book.v(v), F.chapter.v(v), F.verse.v(v), \n", " ))\n", "outf.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a tab-separated file with fields:\n", "\n", " lexeme\n", " group number\n", " has?\n", " book\n", " chapter\n", " verse\n", " \n", "We list the lexemes that occur in at least one group of parallel passages where the lexeme in question does not occur in all members of the group.\n", "For all such lexemes and for all such groups and for all members of such groups we make an entry.\n", "The entry has + in field *has?* if the lexeme occurs in that passage, else it has -.\n", "\n", "See below for the lines corresponding to the first 4 lexemes." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "lexeme\tgroup\tbook\tchapter\tverse\n", "