{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71\" target=\"_blank\"><img align=\"left\"src=\"images/etcbc4easy-small.png\"/></a>\n",
    "<a href=\"http://laf-fabric.readthedocs.org/en/latest/\" target=\"_blank\"><img align=\"left\" src=\"images/laf-fabric-xsmall.png\"/></a>\n",
    "<a href=\"http://www.godgeleerdheid.vu.nl/etcbc\" target=\"_blank\"><img align=\"right\" src=\"images/VU-ETCBC-xsmall.png\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Rare lexemes in parallels\n",
    "\n",
    "Joint work of Kenneth Bergland, Martijn Naaijer and Dirk Roorda.\n",
    "\n",
    "We try to find rare combinations of words and see how they occur in Thora books and in the Prophets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0.00s This is LAF-Fabric 4.5.5\n",
      "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n",
      "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import sys,os\n",
    "import collections, difflib\n",
    "from IPython.display import HTML, display_pretty, display_html\n",
    "from difflib import SequenceMatcher\n",
    "\n",
    "import laf\n",
    "from laf.fabric import LafFabric\n",
    "from etcbc.preprocess import prepare\n",
    "fabric = LafFabric()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "source = 'etcbc'\n",
    "version = '4b'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0.00s LOADING API: please wait ... \n",
      "  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56\n",
      "  0.00s INFO: USING DATA COMPILED AT: 2015-11-03T06-44-21\n",
      "  4.17s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/rare_lexemes/__log__rare_lexemes.txt\n",
      "    13s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html\n",
      "  0.00s LOADING API with EXTRAs: please wait ... \n",
      "  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56\n",
      "  0.00s INFO: USING DATA COMPILED AT: 2015-11-03T06-44-21\n",
      "  0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK rare_lexemes AT 2015-12-17T10-55-18\n",
      "  0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK rare_lexemes AT 2015-12-17T10-55-18\n"
     ]
    }
   ],
   "source": [
    "API = fabric.load(source+version, 'lexicon', 'rare_lexemes', {\n",
    "    \"xmlids\": {\"node\": False, \"edge\": False},\n",
    "    \"features\": ('''\n",
    "        otype\n",
    "        language lex g_cons g_word_utf8 trailer_utf8 phono phono_sep\n",
    "        sp\n",
    "        book chapter verse label number\n",
    "    ''',''),\n",
    "    \"prepare\": prepare,\n",
    "    \"primary\": False,\n",
    "}, verbose='NORMAL')\n",
    "exec(fabric.localnames.format(var='fabric'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "book_classes = dict(\n",
    "    legal = set('''\n",
    "        Exodus Leviticus Deuteronomium\n",
    "    '''.strip().split()),\n",
    "    prophets = set('''\n",
    "        Jesaia Jeremia\n",
    "        Ezechiel Hosea\n",
    "        Joel Amos Obadia Jona Micha Nahum Habakuk Zephania Haggai Sacharia Maleachi\n",
    "    '''.strip().split())\n",
    ")\n",
    "book_classes_index = {}\n",
    "for (cl, books) in book_classes.items():\n",
    "    for book in books:\n",
    "        book_classes_index[book] = cl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Leave out non-content words\n",
    "\n",
    "It is a bit sorry to have all those bigrams involving an article, for example.\n",
    "So, as an option, we leave out the non content words, being words with a certain part-of-speech.\n",
    "\n",
    "    X art\tarticle\n",
    "      verb\tverb\n",
    "      subs\tnoun\n",
    "      nmpr\tproper noun\n",
    "      advb\tadverb\n",
    "    X prep\tpreposition\n",
    "    X conj\tconjunction\n",
    "    X prps\tpersonal pronoun\n",
    "    X prde\tdemonstrative pronoun\n",
    "    X prin\tinterrogative pronoun\n",
    "    X intj\tinterjection\n",
    "    X nega\tnegative particle\n",
    "    X inrg\tinterrogative particle\n",
    "      adjv\tadjective"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ONLY_SP = set('''\n",
    "    verb subs nmpr advb adjv\n",
    "'''.strip().split())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gather bi- and tri-grams\n",
    "\n",
    "We create a big set of bi-tri-grams and count their frequencies.\n",
    "We only consider bi-tri-grams that of which all members are part of the same clause.\n",
    "\n",
    "We want to know the frequencies in the thora, in the prophets, and in the whole bible.\n",
    "\n",
    "## Get all the grams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "    11s Collecting all <= 4-grams\n",
      "    16s Done\n"
     ]
    }
   ],
   "source": [
    "grams = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.Counter()))\n",
    "ng = 4 # the max number of members in the n-gram\n",
    "msg('Collecting all <= {}-grams'.format(ng))\n",
    "for c in F.otype.s('clause'):\n",
    "    bk = F.book.v(L.u('book', c))\n",
    "    words = list(L.d('word', c))\n",
    "    lenw = len(words)\n",
    "    for n in range(1, ng + 1):\n",
    "        for i in range(lenw - n):\n",
    "            this_gram = '-'.join(F.lex.v(w) if F.sp.v(w) in ONLY_SP else '*' for w in (words[j] for j in range(i, n)))\n",
    "            grams[n][bk][this_gram] += 1\n",
    "msg('Done')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Count and rank the grams\n",
    "\n",
    "Now compute the counts for the indicated groups of books."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "    18s Counting grams\n",
      "    19s Done\n"
     ]
    }
   ],
   "source": [
    "freqs_by_g = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.Counter()))\n",
    "freqs_by_c = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.Counter()))\n",
    "\n",
    "msg('Counting grams')\n",
    "for n in grams:\n",
    "    ngrams = grams[n]\n",
    "    for bk in ngrams:\n",
    "        cl = book_classes_index.get(bk, 'bible')\n",
    "        these_grams = ngrams[bk]\n",
    "        for g in these_grams:\n",
    "            freqs_by_g[n][g][cl] += these_grams[g]\n",
    "            freqs_by_c[n][cl][g] += these_grams[g]\n",
    "msg('Done')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Filter the grams\n",
    "\n",
    "We set a threshold below which grams are considered *rare* and another threshold above which grams are considered abundant.\n",
    "\n",
    "Then we list all grams that are abundant in one collection of books but rare in the other one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "    21s Filtering\n",
      "    24s Done\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RARE means <= 1\n",
      "\t1-grams: rr ra ar aa =  1866   423    88   141\n",
      "\t2-grams: rr ra ar aa = 10124  1276   430   700\n",
      "\t3-grams: rr ra ar aa = 29657  1933  1116   723\n",
      "\t4-grams: rr ra ar aa = 42812  1541  1393   458\n",
      "RARE means <= 2\n",
      "\t1-grams: rr ra ar aa =  2120   262    43    93\n",
      "\t2-grams: rr ra ar aa = 10997   808   241   484\n",
      "\t3-grams: rr ra ar aa = 31502   927   535   465\n",
      "\t4-grams: rr ra ar aa = 44768   603   583   250\n",
      "RARE means <= 3\n",
      "\t1-grams: rr ra ar aa =  2232   181    28    77\n",
      "\t2-grams: rr ra ar aa = 11403   566   178   383\n",
      "\t3-grams: rr ra ar aa = 32193   555   346   335\n",
      "\t4-grams: rr ra ar aa = 45384   309   351   160\n",
      "RARE means <= 4\n",
      "\t1-grams: rr ra ar aa =  2307   124    24    63\n",
      "\t2-grams: rr ra ar aa = 11643   433   135   319\n",
      "\t3-grams: rr ra ar aa = 32525   383   262   259\n",
      "\t4-grams: rr ra ar aa = 45616   215   246   127\n",
      "RARE means <= 5\n",
      "\t1-grams: rr ra ar aa =  2352    96    18    52\n",
      "\t2-grams: rr ra ar aa = 11803   334   115   278\n",
      "\t3-grams: rr ra ar aa = 32721   295   203   210\n",
      "\t4-grams: rr ra ar aa = 45781   153   173    97\n",
      "RARE means <= 6\n",
      "\t1-grams: rr ra ar aa =  2375    81    20    42\n",
      "\t2-grams: rr ra ar aa = 11903   283    97   247\n",
      "\t3-grams: rr ra ar aa = 32855   245   152   177\n",
      "\t4-grams: rr ra ar aa = 45879   123   116    86\n",
      "RARE means <= 7\n",
      "\t1-grams: rr ra ar aa =  2386    77    21    34\n",
      "\t2-grams: rr ra ar aa = 11974   261    82   213\n",
      "\t3-grams: rr ra ar aa = 32950   203   121   155\n",
      "\t4-grams: rr ra ar aa = 45929   101   101    73\n",
      "RARE means <= 8\n",
      "\t1-grams: rr ra ar aa =  2407    64    19    28\n",
      "\t2-grams: rr ra ar aa = 12029   230    82   189\n",
      "\t3-grams: rr ra ar aa = 33018   168   107   136\n",
      "\t4-grams: rr ra ar aa = 45980    76    82    66\n",
      "RARE means <= 9\n",
      "\t1-grams: rr ra ar aa =  2423    52    17    26\n",
      "\t2-grams: rr ra ar aa = 12076   209    76   169\n",
      "\t3-grams: rr ra ar aa = 33067   147    92   123\n",
      "\t4-grams: rr ra ar aa = 46016    58    67    63\n"
     ]
    }
   ],
   "source": [
    "RARE = 1\n",
    "\n",
    "class_a = 'legal'\n",
    "class_b = 'prophets'\n",
    "\n",
    "results = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.Counter()))\n",
    "\n",
    "msg('Filtering')\n",
    "for RARE in range(1, 10):\n",
    "    for n in sorted(freqs_by_g):\n",
    "        nfreqs_by_g = freqs_by_g[n]\n",
    "        for g in nfreqs_by_g:\n",
    "            g_info = nfreqs_by_g[g]\n",
    "            nc = dict((cl, g_info.get(cl, 0)) for cl in book_classes)\n",
    "            if nc[class_a] <= RARE and nc[class_b] <= RARE: results[RARE][n]['rr'] += 1 \n",
    "            elif nc[class_a] <= RARE and nc[class_b] > RARE: results[RARE][n]['ra'] += 1\n",
    "            elif nc[class_a] > RARE and nc[class_b] <= RARE: results[RARE][n]['ar'] += 1\n",
    "            elif nc[class_a] > RARE and nc[class_b] > RARE: results[RARE][n]['aa'] += 1\n",
    "            else: results[RARE][n]['xx'] += 1\n",
    "msg('Done')\n",
    "\n",
    "for RARE in sorted(results):\n",
    "    print('RARE means <= {}'.format(RARE))\n",
    "    for n in range(1, ng + 1):\n",
    "        print('\\t{}-grams: rr ra ar aa = {}'.format(\n",
    "            n, \n",
    "            ' '.join(\n",
    "                    '{:>5}'.format(results[RARE].get(n, {}).get(y, 0)) for y in ('rr', 'ra', 'ar', 'aa')\n",
    "                ),\n",
    "            ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Verse index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "21m 38s Making a mapping between a passage specification and a verse node\n",
      "21m 39s 23213 verses\n"
     ]
    }
   ],
   "source": [
    "msg(\"Making a mapping between a passage specification and a verse node\")\n",
    "passage2vnode = {}\n",
    "for vs in F.otype.s('verse'):\n",
    "    lab = (F.book.v(vs), F.chapter.v(vs), F.verse.v(vs))\n",
    "    passage2vnode[lab] = vs\n",
    "msg(\"{} verses\".format(len(passage2vnode)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Ways forward\n",
    "\n",
    "1. restrict to indicated chapters in each of the books (that correspond to the agreed genre of *legal inside thora*)\n",
    "2. compute unique parallels between different collections in order to get a baseline of how many to expect.\n",
    "3. alternatively, use parallel passages to find parallel phrase and then focus on things that are rare.\n",
    "\n",
    "# Other indicators\n",
    "\n",
    "Look for *update lexemes*, where in parallel phrases one lexeme is different, for example measure nouns.\n",
    "See for example Ezek 45:10 - Lev 19:36 - Deut 25:15.\n",
    "\n",
    "This is a case of parallel language use, not strictly a *parallel passage*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Difference lexemes\n",
    "\n",
    "We are going to list the lexemes that make a difference between parallel passages.\n",
    "For every set of parallel passages we list the lexemes that do not occur in their intersection.\n",
    "\n",
    "We print a list of all groups of parallel verses, where the lexemes that do not occur in all members of the group are highlighted.\n",
    "\n",
    "Every lexeme that occurs in some parallel passage, but not in the intersection of the lexemes of its parallel passages will be listed in a dictionary, keyed by that lexeme, and then by the group number, and then it has the following information: passages in which it does occur, and passages in which it does not occur."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "21m 43s /Users/dirk/SURFdrive/current/demos/github/shebanq/static/docs/tools/parallel/files/crossrefdb.csv\n"
     ]
    }
   ],
   "source": [
    "CROSSREF_TOOL = 'parallel'\n",
    "CROSSREF_DB_FILE = 'crossrefdb.csv'\n",
    "SHEBANQ_PATH = os.path.abspath('{}/../../../shebanq'.format(os.getcwd))\n",
    "CROSSREF_DB_DIR = '{}/static/docs/tools/{}/files'.format(SHEBANQ_PATH, CROSSREF_TOOL)\n",
    "CROSSREF_DB_PATH = '{}/{}'.format(CROSSREF_DB_DIR, CROSSREF_DB_FILE)\n",
    "msg(CROSSREF_DB_PATH)\n",
    "PRETTY_PAIRS = 'pairs.html'\n",
    "DIFF_LEX = 'difflex.csv'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<a target=\"_blank\" href=\"https://rawgit.com/etcbc/laf-fabric-nbs/master/lingvar/pairs.html\">parallel_pairs</a>\n",
       "<a target=\"_blank\" href=\"https://rawgit.com/etcbc/laf-fabric-nbs/master/lingvar/difflex.csv\">difference lexemes</a>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loc_tpl = 'https://rawgit.com/etcbc/laf-fabric-nbs/master/lingvar/{}'\n",
    "HTML('''\n",
    "<a target=\"_blank\" href=\"{}\">parallel_pairs</a>\n",
    "<a target=\"_blank\" href=\"{}\">difference lexemes</a>\n",
    "'''.format(loc_tpl.format(PRETTY_PAIRS), loc_tpl.format(DIFF_LEX)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "21m 52s Reading crossrefs database\n",
      "21m 52s 14354 entries read\n",
      "21m 52s Gathered 1235 sets of parallel verses\n"
     ]
    }
   ],
   "source": [
    "msg('Reading crossrefs database')\n",
    "n = 0\n",
    "parallel_pairs_proto = collections.defaultdict(lambda: set())\n",
    "group_index = {}\n",
    "group_number = 0\n",
    "with open(CROSSREF_DB_PATH) as f:\n",
    "    for line in f:\n",
    "        n += 1\n",
    "        if n == 1: continue\n",
    "        (bkx, chx, vsx, bky, chy, vsy, rd) = line.rstrip('\\n').split('\\t')\n",
    "        vx = passage2vnode[(bkx, chx, vsx)]\n",
    "        vy = passage2vnode[(bky, chy, vsy)]\n",
    "        gx = group_index.get(vx, None)\n",
    "        gy = group_index.get(vy, None)\n",
    "        if gx == None and gy == None:\n",
    "            group_number += 1\n",
    "            group_index[vx] = group_number\n",
    "            group_index[vy] = group_number\n",
    "        elif gx == None:\n",
    "            group_index[vx] = gy\n",
    "        elif gy == None:\n",
    "            group_index[vy] = gx\n",
    "        elif gx != gy:\n",
    "            update = [x for x in group_index if group_index[x] == gx]\n",
    "            for x in update: group_index[x] = gy\n",
    "msg('{} entries read'.format(n))\n",
    "\n",
    "for (x, n) in group_index.items(): parallel_pairs_proto[n].add(x)\n",
    "parallel_pairs = [sorted(parallel_pairs_proto[n]) for n in parallel_pairs_proto]\n",
    "parallel_pairs_proto = None\n",
    "msg('Gathered {} sets of parallel verses'.format(len(parallel_pairs)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "21m 59s Building a difference lexeme table\n",
      "22m 00s 1235 groups in table\n"
     ]
    }
   ],
   "source": [
    "msg('Building a difference lexeme table')\n",
    "intersections = []\n",
    "unions = []\n",
    "for p in parallel_pairs:\n",
    "    clex = None\n",
    "    ulex = collections.defaultdict(lambda: set())\n",
    "    for x in p:\n",
    "        xlex = {F.lex.v(w) for w in L.d('word', x)}\n",
    "        clex = xlex if clex == None else clex & xlex\n",
    "        for lex in xlex:\n",
    "            ulex[lex].add(x)\n",
    "    intersections.append(clex)\n",
    "    unions.append(ulex)\n",
    "msg('{} groups in table'.format(len(intersections)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "css = '''\n",
    "<style type=\"text/css\">\n",
    "table.t {\n",
    "    width: 100%;\n",
    "    border-collapse: collapse;\n",
    "    direction: rtl;\n",
    "    border: 2px solid #aaaaaa;\n",
    "}\n",
    "td.t {\n",
    "    border: 2px solid #aaaaaa;\n",
    "    font-family: Ezra SIL, SBL Hebrew, Verdana, sans-serif;\n",
    "    font-size: x-large;\n",
    "    line-height: 1.7;\n",
    "    text-align: right;\n",
    "    direction: rtl;\n",
    "}\n",
    "td.vl {\n",
    "    font-family: Verdana, Arial, sans-serif;\n",
    "    font-size: small;\n",
    "    text-align: right;\n",
    "    color: #aaaaaa;\n",
    "    width: 10%;\n",
    "    direction: ltr;\n",
    "    border-left: 2px solid #aaaaaa;\n",
    "    border-right: 2px solid #aaaaaa;\n",
    "}\n",
    "span.m {\n",
    "    background-color: #aaaaff;\n",
    "}\n",
    "span.f {\n",
    "    background-color: #ffaaaa;\n",
    "}\n",
    "span.x {\n",
    "    background-color: #ffffaa;\n",
    "    color: #bb0000;\n",
    "}\n",
    "span.delete {\n",
    "    background-color: #ffaaaa;\n",
    "}\n",
    "span.insert {\n",
    "    background-color: #aaffaa;\n",
    "}\n",
    "span.replace {\n",
    "    background-color: #ffff00;\n",
    "}\n",
    "</style>\n",
    "'''\n",
    "\n",
    "html_file_tpl = '''<html>\n",
    "<head>\n",
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\n",
    "<title>{}</title>\n",
    "{}\n",
    "</head>\n",
    "<body>\n",
    "<h1>Table of groups</h1>\n",
    "{}\n",
    "<h1>Groups</h1>\n",
    "{}\n",
    "</body>\n",
    "</html>'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# we want to sort passages in such a way that the verses in 1 Kings 19-26 are put before anything else\n",
    "\n",
    "def print_chunk(i, clex, members):\n",
    "    result = ['''\n",
    "<a name=\"c_{i}\">Group {i}</a>\n",
    "<table class=\"t\">\n",
    "'''.format(i=i)]\n",
    "    for v in members:\n",
    "        lab = '{} {}:{}'.format(F.book.v(v), F.chapter.v(v), F.verse.v(v))\n",
    "        text = ''.join(\n",
    "            (F.g_word_utf8.v(w)\\\n",
    "                if F.lex.v(w) in clex \\\n",
    "                else '<span class=\"replace\">{}</span>'.format(F.g_word_utf8.v(w))\n",
    "            )+F.trailer_utf8.v(w) for w in L.d('word', v)\n",
    "        )\n",
    "        result.append('''\n",
    "<tr class=\"t\"><td class=\"vl\">{rb}</td><td class=\"t\">{rl}</td></tr>\n",
    "'''.format(\n",
    "            rb=lab,\n",
    "            rl=text,\n",
    "        ))\n",
    "    result.append('''</table>\n",
    "''')\n",
    "    return ''.join(result)\n",
    "\n",
    "def index_chunk(i, clex, members):\n",
    "    verse_labels = ', '.join('{} {}:{}'.format(F.book.v(v), F.chapter.v(v), F.verse.v(v)) for v in members)\n",
    "    return '<p><b>{i}</b> <a href=\"#c_{i}\">{vl}</a></p>\\n'.format(\n",
    "        vl = verse_labels, i=i,\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22m 21s Pretty printing the table\n",
      "22m 21s Done\n"
     ]
    }
   ],
   "source": [
    "msg('Pretty printing the table')\n",
    "allgeni_html = []\n",
    "allgenh_html = []\n",
    "for (i, p) in enumerate(parallel_pairs):\n",
    "    clex = intersections[i]\n",
    "    allgeni_html.append(index_chunk(i, clex, p))\n",
    "    allgenh_html.append(print_chunk(i, clex, p))\n",
    "outf = open('pairs.html', 'w')\n",
    "outf.write(html_file_tpl.format(\n",
    "    'Pairs',\n",
    "    css,\n",
    "    ''.join(allgeni_html),\n",
    "    ''.join(allgenh_html),\n",
    "))\n",
    "outf.close()\n",
    "msg('Done')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The table with difference lexemes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "diff_pairs = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))\n",
    "for (i, p) in enumerate(parallel_pairs):\n",
    "    clex = intersections[i]\n",
    "    ulex = unions[i]\n",
    "    for lex in ulex:\n",
    "        if lex in clex: continue\n",
    "        diff_pairs[lex][i]['has'] = ulex[lex]\n",
    "        diff_pairs[lex][i]['hasnot'] = set(p) - ulex[lex]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "outf = open('difflex.csv', 'w')\n",
    "outf.write('lexeme\\tgroup\\tsize\\thas?\\tbook\\tchapter\\tverse\\n')\n",
    "for lex in sorted(diff_pairs):\n",
    "    for i in sorted(diff_pairs[lex]):\n",
    "        lnp = len(diff_pairs[lex][i]['has']) + len(diff_pairs[lex][i]['hasnot'])\n",
    "        for v in sorted(diff_pairs[lex][i]['has']):\n",
    "            outf.write('{}\\t{}\\t{}\\t+\\t{}\\t{}\\t{}\\n'.format(\n",
    "                lex, i, lnp, F.book.v(v), F.chapter.v(v), F.verse.v(v), \n",
    "            ))\n",
    "        for v in sorted(diff_pairs[lex][i]['hasnot']):\n",
    "            outf.write('{}\\t{}\\t{}\\t-\\t{}\\t{}\\t{}\\n'.format(\n",
    "                lex, i, lnp, F.book.v(v), F.chapter.v(v), F.verse.v(v), \n",
    "            ))\n",
    "outf.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a tab-separated file with fields:\n",
    "\n",
    "    lexeme\n",
    "    group number\n",
    "    has?\n",
    "    book\n",
    "    chapter\n",
    "    verse\n",
    "    \n",
    "We list the lexemes that occur in at least one group of parallel passages where the lexeme in question does not occur in all members of the group.\n",
    "For all such lexemes and for all such groups and for all members of such groups we make an entry.\n",
    "The entry has + in field *has?* if the lexeme occurs in that passage, else it has -.\n",
    "\n",
    "See below for the lines corresponding to the first 4 lexemes."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "lexeme\tgroup\tbook\tchapter\tverse\n",
    "<BD/\t42\t+\tReges_II\t21\t10\n",
    "<BD/\t42\t-\tEzechiel\t24\t1\n",
    "<BD/\t42\t-\tEzechiel\t26\t1\n",
    "<BD/\t42\t-\tEzechiel\t29\t1\n",
    "<BD/\t42\t-\tEzechiel\t29\t17\n",
    "<BD/\t42\t-\tEzechiel\t30\t20\n",
    "<BD/\t42\t-\tEzechiel\t31\t1\n",
    "<BD/\t42\t-\tEzechiel\t32\t1\n",
    "<BD/\t42\t-\tEzechiel\t32\t17\n",
    "<BD/\t42\t-\tHaggai\t1\t3\n",
    "<BD/\t42\t-\tHaggai\t2\t1\n",
    "<BD/\t83\t+\tReges_II\t25\t8\n",
    "<BD/\t83\t-\tJeremia\t52\t12\n",
    "<BD/\t176\t+\tJesaia\t37\t24\n",
    "<BD/\t176\t-\tReges_II\t19\t23\n",
    "<BD/\t237\t+\tReges_II\t25\t24\n",
    "<BD/\t237\t-\tJeremia\t40\t9\n",
    "<BD/\t456\t+\tLeviticus\t25\t55\n",
    "<BD/\t456\t-\tExodus\t29\t46\n",
    "<BD/\t456\t-\tLeviticus\t11\t45\n",
    "<BD/\t456\t-\tLeviticus\t19\t36\n",
    "<BD/\t456\t-\tLeviticus\t22\t33\n",
    "<BD/\t456\t-\tLeviticus\t25\t38\n",
    "<BD/\t456\t-\tNumeri\t15\t41\n",
    "<BD/\t516\t+\tPsalmi\t18\t1\n",
    "<BD/\t516\t-\tSamuel_II\t22\t1\n",
    "<BD/\t559\t+\tJosua\t1\t15\n",
    "<BD/\t559\t-\tDeuteronomium\t3\t20\n",
    "<BD/\t611\t+\tChronica_II\t9\t21\n",
    "<BD/\t611\t-\tReges_I\t10\t22\n",
    "<BD/\t715\t+\tReges_II\t8\t19\n",
    "<BD/\t715\t-\tChronica_II\t21\t7\n",
    "<BD/\t717\t+\tJeremia\t37\t2\n",
    "<BD/\t717\t-\tJeremia\t50\t1\n",
    "<BD/\t765\t+\tSamuel_I\t21\t12\n",
    "<BD/\t765\t-\tSamuel_I\t29\t5\n",
    "<BD/\t1133\t+\tGenesis\t42\t13\n",
    "<BD/\t1133\t-\tGenesis\t42\t32\n",
    "<BD/\t1158\t+\tJeremia\t46\t28\n",
    "<BD/\t1158\t-\tJeremia\t30\t11\n",
    "<BDH/\t35\t+\tExodus\t39\t32\n",
    "<BDH/\t35\t+\tExodus\t39\t42\n",
    "<BDH/\t35\t-\tExodus\t7\t6\n",
    "<BDH/\t35\t-\tExodus\t12\t28\n",
    "<BDH/\t35\t-\tExodus\t12\t50\n",
    "<BDH/\t35\t-\tExodus\t40\t16\n",
    "<BDH/\t35\t-\tNumeri\t1\t54\n",
    "<BDH/\t35\t-\tNumeri\t8\t20\n",
    "<BDH/\t35\t-\tNumeri\t17\t26\n",
    "<BDH/\t35\t-\tNumeri\t30\t1\n",
    "<BDH/\t35\t-\tNumeri\t36\t10\n",
    "<BDH/\t35\t-\tJosua\t14\t5\n",
    "<BDH/\t143\t+\tNumeri\t4\t23\n",
    "<BDH/\t143\t+\tNumeri\t4\t30\n",
    "<BDH/\t143\t+\tNumeri\t4\t35\n",
    "<BDH/\t143\t+\tNumeri\t4\t39\n",
    "<BDH/\t143\t+\tNumeri\t4\t43\n",
    "<BDH/\t143\t+\tNumeri\t4\t47\n",
    "<BDH/\t143\t+\tNumeri\t8\t24\n",
    "<BDH/\t143\t-\tNumeri\t4\t3\n",
    "<BDH/\t985\t+\tExodus\t27\t19\n",
    "<BDH/\t985\t-\tExodus\t38\t20\n",
    "<BDJH/\t1001\t+\tObadia\t1\t1\n",
    "<BDJH/\t1001\t-\tJeremia\t49\t14\n",
    "<BDWN/\t913\t+\tChronica_II\t34\t20\n",
    "<BDWN/\t913\t-\tReges_II\t22\t12"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}