{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img align=\"right\" src=\"images/dans-small.png\"/>\n",
    "<img align=\"right\" src=\"images/tf-small.png\"/>\n",
    "<img align=\"right\" src=\"images/etcbc.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# BHSA and OpenScriptures bridge\n",
    "\n",
    "Both the BHSA and the OpenScriptures represent efforts to add linguistic markup to the Hebrew Bible.\n",
    "\n",
    "The BHSA is the product of years of encoding work by researchers, in a strongly algorithmic fashion, although\n",
    "not without human decisions at the micro level.\n",
    "\n",
    "OpenScriptures represents a crowd-sourced approach.\n",
    "\n",
    "Regardless of theoretical considerations on the validity of these approaches, it is worthwhile to be able to compare them.\n",
    "Moreover, for some research problems, it might be helpful to use both encodings in one toolkit.\n",
    "\n",
    "In this repo we develop a way of doing exactly this.\n",
    "\n",
    "We make a link between the morphology in the\n",
    "[Openscriptures](http://openscriptures.org)\n",
    "and the linguistics in the [BHSA](https://github.com/ETCBC/bhsa).\n",
    "\n",
    "We proceed as follows:\n",
    "\n",
    "* extract the morphology from the files in\n",
    "  [openscriptures/morphhb/wlc](https://github.com/openscriptures/morphhb/tree/master/wlc)\n",
    "* link the words in the OpenScriptures files to slots in the BHSA\n",
    "* compile the OpenScriptures morphology data into a TF feature file.\n",
    "\n",
    "With this in hand, we have the OpenScriptures morphology in Text-Fabric, aligned to the BHSA.\n",
    "That opens the way for further comparisons, which take the actual morphology into account."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## History\n",
    "\n",
    "When we first made the comparison, in 2017, only 88% of the OpenScriptures Morphology was fixed.\n",
    "\n",
    "In 2021 we have pulled the same repository of Open Scriptures again, and used a new version of the BHSA as well.\n",
    "It turns out the 100% of the words have been morphologically annotated by OpenScriptures now."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Application\n",
    "\n",
    "This notebook sets the stage for focused comparisons between the BHSA features on words and the OSM morphology.\n",
    "\n",
    "See\n",
    "\n",
    "* [category](category.ipynb)\n",
    "* [language](language.ipynb)\n",
    "* [part-of-speech](part-of-speech.ipynb)\n",
    "* [verb](verb.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import collections\n",
    "import yaml\n",
    "from glob import glob\n",
    "from lxml import etree\n",
    "from itertools import zip_longest\n",
    "from functools import reduce\n",
    "from unicodedata import normalize, category\n",
    "\n",
    "from tf.fabric import Fabric\n",
    "from tf.core.helpers import rangesFromSet, formatMeta\n",
    "import utils\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pipeline\n",
    "See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)\n",
    "for how to run this script in the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "if \"SCRIPT\" not in locals():\n",
    "    SCRIPT = False\n",
    "    FORCE = True\n",
    "    CORE_NAME = \"bhsa\"\n",
    "    NAME = \"bridging\"\n",
    "    VERSION = \"2021\"\n",
    "\n",
    "\n",
    "def stop(good=False):\n",
    "    if SCRIPT:\n",
    "        sys.exit(0 if good else 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook can run a lot of tests and create a lot of examples.\n",
    "However, when run in the pipeline, we only want to create the two `osm` features.\n",
    "\n",
    "So, further on, there will be quite a bit of code under the condition `not SCRIPT`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setting up the context: source file and target directories\n",
    "\n",
    "The conversion is executed in an environment of directories, so that sources, temp files and\n",
    "results are in convenient places and do not have to be shifted around."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "repoBase = os.path.expanduser(\"~/github/etcbc\")\n",
    "coreRepo = \"{}/{}\".format(repoBase, CORE_NAME)\n",
    "thisRepo = \"{}/{}\".format(repoBase, NAME)\n",
    "\n",
    "coreTf = \"{}/tf/{}\".format(coreRepo, VERSION)\n",
    "\n",
    "thisTemp = \"{}/_temp/{}\".format(thisRepo, VERSION)\n",
    "thisTempTf = \"{}/tf\".format(thisTemp)\n",
    "\n",
    "thisTf = \"{}/tf/{}\".format(thisRepo, VERSION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test\n",
    "\n",
    "Check whether this conversion is needed in the first place.\n",
    "Only when run as a script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "if SCRIPT:\n",
    "    (good, work) = utils.mustRun(\n",
    "        None, \"{}/.tf/{}.tfx\".format(thisTf, \"osm\"), force=FORCE\n",
    "    )\n",
    "    if not good:\n",
    "        stop(good=False)\n",
    "    if not work:\n",
    "        stop(good=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load the BHSA data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".       0.00s Load the existing TF dataset                                                   .\n",
      "..............................................................................................\n",
      "This is Text-Fabric 9.1.7\n",
      "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n",
      "\n",
      "114 features found and 0 ignored\n",
      "  0.00s loading features ...\n",
      "   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API\n",
      "    11s All features loaded/computed - for details use TF.isLoaded()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('Computed',\n",
       "  'computed-data',\n",
       "  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),\n",
       " ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),\n",
       " ('Fabric', 'loading', ('TF',)),\n",
       " ('Locality', 'locality', ('L Locality',)),\n",
       " ('Nodes', 'navigating-nodes', ('N Nodes',)),\n",
       " ('Features',\n",
       "  'node-features',\n",
       "  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),\n",
       " ('Search', 'search', ('S Search',)),\n",
       " ('Text', 'text', ('T Text',))]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "utils.caption(4, \"Load the existing TF dataset\")\n",
    "TF = Fabric(locations=coreTf, modules=[\"\"])\n",
    "\n",
    "api = TF.load(\n",
    "    \"\"\"\n",
    "    book\n",
    "    g_cons_utf8 g_word_utf8\n",
    "\"\"\"\n",
    ")\n",
    "api.makeAvailableIn(globals())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading Open Scriptures Morphology"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "NB_DIR = os.getcwd()\n",
    "OS_BASE = os.path.expanduser(\"~/github/openscriptures/morphhb/wlc\")\n",
    "os.chdir(OS_BASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mapping the book names\n",
    "OSM uses abbreviated book names.\n",
    "We map them onto the (latin) book names of the BHSA.\n",
    "\n",
    "Here is a list of the BHSA books."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         11s Genesis Exodus Leviticus Numeri Deuteronomium Josua Judices Samuel_I Samuel_II Reges_I Reges_II Jesaia Jeremia Ezechiel Hosea Joel Amos Obadia Jona Micha Nahum Habakuk Zephania Haggai Sacharia Maleachi Psalmi Iob Proverbia Ruth Canticum Ecclesiastes Threni Esther Daniel Esra Nehemia Chronica_I Chronica_II\n"
     ]
    }
   ],
   "source": [
    "bhsBooks = [F.book.v(n) for n in F.otype.s(\"book\")]\n",
    "utils.caption(0, \" \".join(bhsBooks))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell can be used to retrieve the OSM book names,\n",
    "from which the ordered list `osmBooks` below can be composed manually."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         11s 1Chr 1Kgs 1Sam 2Chr 2Kgs 2Sam Amos Dan Deut Eccl Esth Exod Ezek Ezra Gen Hab Hag Hos Isa Jer Job Joel Jonah Josh Judg Lam Lev Mal Mic Nah Neh Num Obad Prov Ps Ruth Song Zech Zeph\n"
     ]
    }
   ],
   "source": [
    "osmBookSet = set(fn[0:-4] for fn in glob(\"*.xml\") if fn != \"VerseMap.xml\")\n",
    "utils.caption(0, \" \".join(sorted(osmBookSet)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We list the books in the \"canonical\" order (as given in the BHSA)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "osmBooks = \"\"\"\n",
    "Gen Exod Lev Num Deut\n",
    "Josh Judg 1Sam 2Sam 1Kgs 2Kgs\n",
    "Isa Jer Ezek Hos Joel Amos Obad\n",
    "Jonah Mic Nah Hab Zeph Hag Zech Mal\n",
    "Ps Job Prov Ruth Song Eccl Lam Esth\n",
    "Dan Ezra Neh 1Chr 2Chr\n",
    "\"\"\".strip().split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We check whether we did not overlook books or missed changes in the OSM abbreviations of the books."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "osmBookSet == set(osmBooks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can construct the mapping, both ways."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "osmBookFromBhs = {}\n",
    "bhsBookFromOsm = {}\n",
    "for (i, bhsBook) in enumerate(bhsBooks):\n",
    "    osmBook = osmBooks[i]\n",
    "    osmBookFromBhs[bhsBook] = osmBook\n",
    "    bhsBookFromOsm[osmBook] = bhsBook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Consonantal matters\n",
    "\n",
    "For alignment purposes, we reduce all textual material to its consonantal representation.\n",
    "Sometimes we need to blur the distinction between final consonants and their normal counterparts.\n",
    "\n",
    "In order to strip consonants from all their diacritical marks, we use unicode denormalization and\n",
    "unicode character categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "lines_to_end_of_cell_marker": 2
   },
   "outputs": [],
   "source": [
    "NS = \"{http://www.bibletechnologies.net/2003/OSIS/namespace}\"\n",
    "NFD = \"NFD\"\n",
    "LO = \"Lo\"\n",
    "\n",
    "finals = {\n",
    "    \"ך\": \"כ\",\n",
    "    \"ם\": \"מ\",\n",
    "    \"ן\": \"נ\",\n",
    "    \"ף\": \"פ\",\n",
    "    \"ץ\": \"צ\",\n",
    "}\n",
    "\n",
    "finalsI = {v: k for (k, v) in finals.items()}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "`toCons(s)`: strip all pointing (accents, vowels, dagesh, shin/sin dot) from all characters in string `s`.\n",
    "\n",
    "`final(c)`: replace consonant `c` by its final counterpart, if there is one, otherwise return `c`.\n",
    "\n",
    "`finalCons(s)`: replace the last character of `s` by its final counterpart.\n",
    "\n",
    "`unFinal(s)`: replace all consonants in `s` by their non-final counterparts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "def toCons(s):\n",
    "    return \"\".join(c for c in normalize(NFD, s) if category(c) == LO)\n",
    "\n",
    "\n",
    "def final(c):\n",
    "    return finalsI.get(c, c)\n",
    "\n",
    "\n",
    "def finalCons(s):\n",
    "    return s[0:-1] + final(s[-1])\n",
    "\n",
    "\n",
    "def unFinal(s):\n",
    "    return \"\".join(finals.get(c, c) for c in s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "## Read OSM books\n",
    "\n",
    "We are going to read the OSM files.\n",
    "They correspond to books.\n",
    "\n",
    "We drill down to verse nodes and pick up the `<w>` elements.\n",
    "What we need from these elements is the full text content and the attributes\n",
    "`lemma` and `morph`.\n",
    "\n",
    "We ignore markup within the full text content of the `<w>` elements.\n",
    "\n",
    "The material we extract, may contain `/`.\n",
    "We split the text content and the `lemma` and `morph` content on `/`, and recombine the resulting parts in\n",
    "OSM morpheme entries, having each a full-text bit, a morph bit and a lemma bit.\n",
    "\n",
    "Caveat: when splitting the morpheme string, we should first split off the first character, which indicates language,\n",
    "and then add it to all the parts!\n",
    "\n",
    "So, one `<w>` element may give rise to several morpheme entries.\n",
    "\n",
    "The full text is fully pointed. We also compute a consonantal version of the full text and store it\n",
    "within the morpheme entries.\n",
    "\n",
    "We end up with a list, `osmMorphemes` of morpheme entries.\n",
    "\n",
    "In passing, we count the `<w>` elements without `morph` attributes, and those without textual content.\n",
    "\n",
    "We also store the book, chapter, verse and sequence number of the `<w>` element in each entry."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def readOsmBook(osmBook, osmMorphemes, stats):\n",
    "    infile = \"{}.xml\".format(osmBook)\n",
    "    parser = etree.XMLParser(remove_blank_text=True, ns_clean=True)\n",
    "    root = etree.parse(infile, parser).getroot()\n",
    "    osisTextNode = root[0]\n",
    "    divNode = osisTextNode[1]\n",
    "    chapterNodes = list(divNode)\n",
    "    utils.caption(\n",
    "        0,\n",
    "        \"reading {:<5} ({:<15}) {:>3} chapters\".format(\n",
    "            osmBook, bhsBookFromOsm[osmBook], len(chapterNodes)\n",
    "        ),\n",
    "    )\n",
    "    ch = 0\n",
    "    for chapterNode in chapterNodes:\n",
    "        if chapterNode.tag != NS + \"chapter\":\n",
    "            continue\n",
    "        ch += 1\n",
    "        vs = 0\n",
    "        for verseNode in list(chapterNode):\n",
    "            if verseNode.tag != NS + \"verse\":\n",
    "                continue\n",
    "            vs += 1\n",
    "            w = 0\n",
    "            for wordNode in list(verseNode):\n",
    "                if wordNode.tag != NS + \"w\":\n",
    "                    continue\n",
    "                w += 1\n",
    "                lemma = wordNode.get(\"lemma\", None)\n",
    "                morph = wordNode.get(\"morph\", None)\n",
    "                text = \"\".join(x for x in wordNode.itertext())\n",
    "\n",
    "                lemmas = lemma.split(\"/\") if lemma is not None else []\n",
    "                morphs = morph.split(\"/\") if morph is not None else []\n",
    "                if len(morphs) > 1:\n",
    "                    lang = morphs[0][0]\n",
    "                    morphs = [morphs[0]] + [lang + m for m in morphs[1:]]\n",
    "                texts = text.split(\"/\") if text is not None else []\n",
    "                # zip_longest accomodates for unequal lengths of its operands\n",
    "                # for missing values we fill in ''\n",
    "                for (lm, mph, tx) in zip_longest(lemmas, morphs, texts, fillvalue=\"\"):\n",
    "                    txc = None if tx is None else toCons(tx)\n",
    "                    osmMorphemes.append((tx, txc, mph, lm, osmBook, ch, vs, w))\n",
    "                    if not mph:\n",
    "                        stats[\"noMorph\"] += 1\n",
    "                    if not tx:\n",
    "                        stats[\"noContent\"] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That was the definition of the read function, now we are going to execute it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         11s reading Gen   (Genesis        )  50 chapters\n",
      "|         11s reading Exod  (Exodus         )  40 chapters\n",
      "|         11s reading Lev   (Leviticus      )  27 chapters\n",
      "|         11s reading Num   (Numeri         )  36 chapters\n",
      "|         11s reading Deut  (Deuteronomium  )  34 chapters\n",
      "|         12s reading Josh  (Josua          )  24 chapters\n",
      "|         12s reading Judg  (Judices        )  21 chapters\n",
      "|         12s reading 1Sam  (Samuel_I       )  31 chapters\n",
      "|         12s reading 2Sam  (Samuel_II      )  24 chapters\n",
      "|         12s reading 1Kgs  (Reges_I        )  22 chapters\n",
      "|         12s reading 2Kgs  (Reges_II       )  25 chapters\n",
      "|         12s reading Isa   (Jesaia         )  66 chapters\n",
      "|         13s reading Jer   (Jeremia        )  52 chapters\n",
      "|         13s reading Ezek  (Ezechiel       )  48 chapters\n",
      "|         13s reading Hos   (Hosea          )  14 chapters\n",
      "|         13s reading Joel  (Joel           )   4 chapters\n",
      "|         13s reading Amos  (Amos           )   9 chapters\n",
      "|         13s reading Obad  (Obadia         )   1 chapters\n",
      "|         13s reading Jonah (Jona           )   4 chapters\n",
      "|         13s reading Mic   (Micha          )   7 chapters\n",
      "|         13s reading Nah   (Nahum          )   3 chapters\n",
      "|         13s reading Hab   (Habakuk        )   3 chapters\n",
      "|         13s reading Zeph  (Zephania       )   3 chapters\n",
      "|         13s reading Hag   (Haggai         )   2 chapters\n",
      "|         13s reading Zech  (Sacharia       )  14 chapters\n",
      "|         13s reading Mal   (Maleachi       )   3 chapters\n",
      "|         13s reading Ps    (Psalmi         ) 150 chapters\n",
      "|         13s reading Job   (Iob            )  42 chapters\n",
      "|         14s reading Prov  (Proverbia      )  31 chapters\n",
      "|         14s reading Ruth  (Ruth           )   4 chapters\n",
      "|         14s reading Song  (Canticum       )   8 chapters\n",
      "|         14s reading Eccl  (Ecclesiastes   )  12 chapters\n",
      "|         14s reading Lam   (Threni         )   5 chapters\n",
      "|         14s reading Esth  (Esther         )  10 chapters\n",
      "|         14s reading Dan   (Daniel         )  12 chapters\n",
      "|         14s reading Ezra  (Esra           )  10 chapters\n",
      "|         14s reading Neh   (Nehemia        )  13 chapters\n",
      "|         14s reading 1Chr  (Chronica_I     )  29 chapters\n",
      "|         14s reading 2Chr  (Chronica_II    )  36 chapters\n",
      "|         14s \n",
      "BHS words:       426590\n",
      "OSM Morphemes:   469440\n",
      "No morphology:        1\n",
      "No content:           1\n",
      "100 % of the words are morphologically annotated.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "osmMorphemes = []\n",
    "stats = dict(noMorph=0, noContent=0)\n",
    "\n",
    "for bn in F.otype.s(\"book\"):\n",
    "    bhsBook = T.sectionFromNode(bn, lang=\"la\")[0]\n",
    "    osmBook = osmBookFromBhs[bhsBook]\n",
    "    readOsmBook(osmBook, osmMorphemes, stats)\n",
    "\n",
    "utils.caption(\n",
    "    0,\n",
    "    \"\"\"\n",
    "BHS words:       {:>6}\n",
    "OSM Morphemes:   {:>6}\n",
    "No morphology:   {:>6}\n",
    "No content:      {:>6}\n",
    "{} % of the words are morphologically annotated.\n",
    "\"\"\".format(\n",
    "        F.otype.maxSlot,\n",
    "        len(osmMorphemes),\n",
    "        stats[\"noMorph\"],\n",
    "        stats[\"noContent\"],\n",
    "        round(\n",
    "            100\n",
    "            * (len(osmMorphemes) - stats[\"noMorph\"] - stats[\"noContent\"])\n",
    "            / len(osmMorphemes)\n",
    "        ),\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To give an impression of the contents of this list, we show the first few members.\n",
    "The column specification is:\n",
    "\n",
    "    consonantal fully-pointed morph lemma book chapter verse `w`-number"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('בְּ', 'ב', 'HR', 'b', 'Gen', 1, 1, 1),\n",
       " ('רֵאשִׁ֖ית', 'ראשית', 'HNcfsa', '7225', 'Gen', 1, 1, 1),\n",
       " ('בָּרָ֣א', 'ברא', 'HVqp3ms', '1254 a', 'Gen', 1, 1, 2),\n",
       " ('אֱלֹהִ֑ים', 'אלהים', 'HNcmpa', '430', 'Gen', 1, 1, 3),\n",
       " ('אֵ֥ת', 'את', 'HTo', '853', 'Gen', 1, 1, 4),\n",
       " ('הַ', 'ה', 'HTd', 'd', 'Gen', 1, 1, 5),\n",
       " ('שָּׁמַ֖יִם', 'שמים', 'HNcmpa', '8064', 'Gen', 1, 1, 5),\n",
       " ('וְ', 'ו', 'HC', 'c', 'Gen', 1, 1, 6),\n",
       " ('אֵ֥ת', 'את', 'HTo', '853', 'Gen', 1, 1, 6),\n",
       " ('הָ', 'ה', 'HTd', 'd', 'Gen', 1, 1, 7),\n",
       " ('אָֽרֶץ', 'ארץ', 'HNcbsa', '776', 'Gen', 1, 1, 7),\n",
       " ('וְ', 'ו', 'HC', 'c', 'Gen', 1, 2, 1),\n",
       " ('הָ', 'ה', 'HTd', 'd', 'Gen', 1, 2, 1),\n",
       " ('אָ֗רֶץ', 'ארץ', 'HNcbsa', '776', 'Gen', 1, 2, 1),\n",
       " ('הָיְתָ֥ה', 'היתה', 'HVqp3fs', '1961', 'Gen', 1, 2, 2)]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(osmMorphemes[0:15])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Alignment\n",
    "\n",
    "We now have to face the task to map the BHSA words to the OSM morphemes.\n",
    "\n",
    "We will encounter the challenge that at some spots the consonantal contents of the WLC (the source of the OSM)\n",
    "is different from that of the BHS, the source of the BHSA.\n",
    "\n",
    "Another challenge is that at some points the analysis behind the OSM differs from that of the BHSA in such a way\n",
    "that the BHSA has a word-split within an OSM morpheme.\n",
    "\n",
    "Yet another source of problems is that the BHSA inserts \"empty\" articles in places where the pointing in the\n",
    "surrounding material allows to conclude that an article is present, although it does not have a consonantal presence anymore."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "We need a function to quickly show what is going on in difficult spots.\n",
    "\n",
    "`showCase(w, j, ln)` shows the BHSA from word `w` onward, and the OSM from morpheme `j` onward.\n",
    "It lists `ln` positions in both sources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def showCase(w, j, ln):\n",
    "    print(T.sectionFromNode(w))\n",
    "    print(\"BHS\")\n",
    "    for n in range(w, w + ln):\n",
    "        print(\"word  {} = [{}]\".format(n, toCons(F.g_cons_utf8.v(n))))\n",
    "    print(\"OSM\")\n",
    "    for n in range(j, j + ln):\n",
    "        print(\"morph {} = [{}]\".format(n, osmMorphemes[n][1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also define another function to easy inspect difficult spots.\n",
    "\n",
    "`BHSvsOSM(ws, js)` compares the BHSA words specified by list `ws` with the OSM morphemes\n",
    "specified by list `js`.\n",
    "\n",
    "Here we bump into the fact that the BHSA deals with whole words, and the OSM splits into morphemes.\n",
    "In this case, the pronominal suffix is treated as a separate morpheme."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def BHSvsOSM(ws, js):\n",
    "    print(\n",
    "        \"{}\\n{:<25}BHS {:<30} = {}\\n{:<25}OSM {:<30} = {}\".format(\n",
    "            \"{} {}:{}\".format(*T.sectionFromNode(ws[0])),\n",
    "            \" \",\n",
    "            \", \".join(str(w) for w in ws),\n",
    "            \"/\".join(F.g_word_utf8.v(w) for w in ws),\n",
    "            \" \",\n",
    "            \", \".join(\"w{}\".format(osmMorphemes[j][7]) for j in js),\n",
    "            \"/\".join(osmMorphemes[j][0] for j in js),\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Algorithm\n",
    "\n",
    "We have to develop a way of aligning each BHS word with one or more OSM morphemes.\n",
    "\n",
    "For each BHS word, we grab OSM morphemes until all consonants in the BHS word have been matched.\n",
    "If needed, we grab additional BHS words\n",
    "when the current OSM string happens to be longer than the current BHS word.\n",
    "\n",
    "We will encounter cases where this method breaks down: exceptions.\n",
    "We will collect them for later inspection.\n",
    "\n",
    "The exceptions are coded as follows:\n",
    "\n",
    "If `w: n` is in the dictionary of exceptions, it means that slot (word) `w` in the BHSA is different from its counterpart morpheme(s) in the OSM.\n",
    "\n",
    "If `n > 0`, that many OSM morphemes will be gobbled to align with slot `w`.\n",
    "\n",
    "If `n < 0`, that many slots from `w` will be gobbled to match the current OSM morpheme.\n",
    "\n",
    "There are various subtleties involved, see the inline content in the code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "allExceptions = {\n",
    "    \"2017\": {\n",
    "        215253: 1,\n",
    "        266189: 1,\n",
    "        287360: 2,\n",
    "        376865: 1,\n",
    "        383405: 2,\n",
    "        384049: 1,\n",
    "        384050: 1,\n",
    "        405102: -2,\n",
    "    },\n",
    "    \"2021\": {\n",
    "        215256: 1,\n",
    "        266192: 1,\n",
    "        287363: 2,\n",
    "        376869: 1,\n",
    "        383409: 2,\n",
    "        384053: 1,\n",
    "        384054: 1,\n",
    "        405108: -2,\n",
    "    },\n",
    "}\n",
    "\n",
    "exceptions = allExceptions[VERSION]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         16s Succeeded in aligning BHS with OSM\n",
      "|         16s 420109 BHS words matched against 469440 OSM morphemes with 8 known exceptions\n"
     ]
    }
   ],
   "source": [
    "# index in the osmMorphemes list\n",
    "j = -1\n",
    "\n",
    "# mapping from BHSA slot numbers to OSM morphemes indices\n",
    "osmFromBhs = {}\n",
    "\n",
    "u = None\n",
    "remainingErrors = False\n",
    "for w in F.otype.s(\"word\"):\n",
    "    # the previous iteration may have already dealt with this word\n",
    "    # in that case, we skip to the next word\n",
    "    # the signal is: w <= u\n",
    "    if u is not None and w <= u:\n",
    "        continue\n",
    "\n",
    "    # we get the consonantal BHSA word string\n",
    "    bhs = toCons(F.g_cons_utf8.v(w))\n",
    "\n",
    "    # if the BHSA word is empty, we do not link it to any OSM morpheme\n",
    "    # and continue\n",
    "    if bhs == \"\":\n",
    "        continue\n",
    "\n",
    "    # we are going to collect OSM morphemes\n",
    "    # as long as the consonantal reps of the morpheme fit into the BHSA word\n",
    "    j += 1\n",
    "    startJ = j\n",
    "    startW = w\n",
    "    osm = osmMorphemes[j][1]\n",
    "\n",
    "    # but if the word is listed as exception, we collect as many morphemes\n",
    "    # as specified in the exception\n",
    "    maxGobble = exceptions.get(w, None)\n",
    "    gobble = 1\n",
    "    while (len(osm) < len(bhs) and bhs.startswith(osm)) or (\n",
    "        maxGobble is not None and maxGobble > 0\n",
    "    ):\n",
    "        if maxGobble is not None and gobble >= maxGobble:\n",
    "            break\n",
    "        j += 1\n",
    "        osm += osmMorphemes[j][1]\n",
    "        gobble += 1\n",
    "\n",
    "    # if the OSM morphemes have become longer than the BHSA word,\n",
    "    # we eat up the following BHSA word(s)\n",
    "    # we let u hold the new BHSA word position\n",
    "    u = w\n",
    "    gobble = 1\n",
    "    while (len(osm) > len(bhs) and osm.startswith(bhs)) or (\n",
    "        maxGobble is not None and maxGobble < 0\n",
    "    ):\n",
    "        if maxGobble is not None and gobble >= -maxGobble:\n",
    "            break\n",
    "        u += 1\n",
    "        bhs += toCons(F.g_cons_utf8.v(u))\n",
    "        gobble += 1\n",
    "    gobble = 1\n",
    "\n",
    "    # if the BHSA words exceed the OSM morphemes found so far, we draw in additional OSM morphemes\n",
    "    # (for the last time)\n",
    "    while len(osm) < len(bhs) and bhs.startswith(osm):\n",
    "        if maxGobble is not None and gobble >= maxGobble:\n",
    "            break\n",
    "        j += 1\n",
    "        osm += osmMorphemes[j][1]\n",
    "        gobble += 1\n",
    "\n",
    "    # now we have gathered a BHSA string of material, and an OSM string of material\n",
    "    # We test if both strings are equal (modulo final consonant issues)\n",
    "    # If not: alignment breaks down, we stop the loop and show the offending case.\n",
    "    # The programmer should inspect the case and add an exception.\n",
    "    if maxGobble is None and finalCons(bhs) != finalCons(osm):\n",
    "        utils.caption(\n",
    "            0,\n",
    "            \"\"\"Mismatch in {} at BHS-{} OS-{}->{}:\\nbhs=[{}]\\nos=[{}]\"\"\".format(\n",
    "                \"{} {}:{}\".format(*T.sectionFromNode(w)),\n",
    "                w,\n",
    "                startJ,\n",
    "                j,\n",
    "                bhs,\n",
    "                osm,\n",
    "            ),\n",
    "        )\n",
    "        showCase(w - 5, startJ - 5, j - startJ + 10)\n",
    "        remainingErrors = True\n",
    "        break\n",
    "\n",
    "    # but if all is well, we link the BHSA words in question to the OSM morphemes in question\n",
    "    # If the BHSA string contains multiple words, we link all those words to all morphemes\n",
    "    for k in range(startW, u + 1):\n",
    "        for m in range(startJ, j + 1):\n",
    "            osmFromBhs.setdefault(k, []).append(m)\n",
    "\n",
    "if not remainingErrors:\n",
    "    utils.caption(0, \"Succeeded in aligning BHS with OSM\")\n",
    "    utils.caption(\n",
    "        0,\n",
    "        \"{} BHS words matched against {} OSM morphemes with {} known exceptions\".format(\n",
    "            len(osmFromBhs),\n",
    "            len(osmMorphemes),\n",
    "            len(exceptions),\n",
    "        ),\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have constructed in passing the mapping `osmFromBhs`,\n",
    "which maps BHSA words onto corresponding sequences of OSM morphemes.\n",
    "We also compute the inverse of this, `bhsFromOsm`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         18s 469440 morphemes mapped in bhsFromOsm\n"
     ]
    }
   ],
   "source": [
    "# mapping from OSM morphemes (by index in osmMorphemes list) to BHSA slot numbers\n",
    "# It is the inverse of osmFromBhs\n",
    "bhsFromOsm = {}\n",
    "\n",
    "for (w, js) in osmFromBhs.items():\n",
    "    for j in js:\n",
    "        bhsFromOsm.setdefault(j, []).append(w)\n",
    "utils.caption(0, \"{} morphemes mapped in bhsFromOsm\".format(len(bhsFromOsm)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Inspection of problems\n",
    "\n",
    "We have encountered irregularities, but we want to make sure we have seen all potential\n",
    "alignment problems.\n",
    "We do this by adding a sanity check: find all cases where\n",
    "the consonantal material in a BHSA word is not the\n",
    "concatenation of the consonantal material in in its OSM morphemes.\n",
    "\n",
    "We have now several irregularities to inspect.\n",
    "\n",
    "1. **Multiplicity**\n",
    "   * By looking into `bhsFromOsm` we can find the OSM morphemes that contain consonantal material\n",
    "     from multiple BHSA words.\n",
    "     These are interesting points of difference between the BHSA and OSM encoding, because\n",
    "     in these cases the OSM produces other word/morpheme boundaries than the BHSA.\n",
    "   * We also inspect cases where a BHSA word corresponds to more than two OSM morphemes.\n",
    "1. **Consonantal Sanity**\n",
    "   By analysis after the fact, we gather all consonantal discrepancies\n",
    "1. **Exceptions**\n",
    "   While developing the algorithm, we needed to invoke a small number of manual exceptions.\n",
    "\n",
    "\n",
    "Now we want to make a comprehensive list of all problematic cases encountered during\n",
    "alignment.\n",
    "\n",
    "We will add the BHSA word numbers involved in a problematic case to the set `problematic`.\n",
    "When we proceed to compare morphology, we will exclude the problematic cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "problematic = set()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multiplicity\n",
    "We gather the cases of multiple BHSA words against a single OSM morpheme."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         18s OSM morphemes without corresponding BHSA word:                    0\n",
      "|         18s OSM morphemes corresponding to multiple BHSA words:             122\n",
      "|         18s OSM morphemes corresponding to 2 BHSA words:                    115\n",
      "|         18s OSM morphemes corresponding to 3 BHSA words:                      7\n"
     ]
    }
   ],
   "source": [
    "multipleOSM = {}  # OSM morphemes in correspondence with multiple BHS slots\n",
    "noOSM = {}  # OSM morphemes that do not correspond to any BHSA word\n",
    "\n",
    "countMultipleOSM = (\n",
    "    collections.Counter()\n",
    ")  # how many times n BHSA words are linked to the same OSM morpheme\n",
    "\n",
    "for (j, ws) in bhsFromOsm.items():\n",
    "    nws = len(ws)\n",
    "    if nws > 1:\n",
    "        multipleOSM[j] = nws\n",
    "        countMultipleOSM[nws] += 1\n",
    "    elif nws == 0:\n",
    "        noOSM.add(j)\n",
    "\n",
    "utils.caption(\n",
    "    0,\n",
    "    \"OSM morphemes without corresponding BHSA word:                {:>5}\".format(\n",
    "        len(noOSM)\n",
    "    ),\n",
    ")\n",
    "utils.caption(\n",
    "    0,\n",
    "    \"OSM morphemes corresponding to multiple BHSA words:           {:>5}\".format(\n",
    "        len(multipleOSM)\n",
    "    ),\n",
    ")\n",
    "for (nws, amount) in sorted(countMultipleOSM.items()):\n",
    "    utils.caption(\n",
    "        0,\n",
    "        \"OSM morphemes corresponding to {} BHSA words:                  {:>5}\".format(\n",
    "            nws, amount\n",
    "        ),\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Genesis 24:65\n",
      "                         BHS 12370, 12371                   = הַ/לָּזֶה֙\n",
      "                         OSM w6                             = הַלָּזֶה֙\n",
      "Genesis 37:19\n",
      "                         BHS 20517, 20518                   = הַ/לָּזֶ֖ה\n",
      "                         OSM w8                             = הַלָּזֶ֖ה\n",
      "Genesis 50:10\n",
      "                         BHS 28426, 28427                   = הָ/אָטָ֗ד\n",
      "                         OSM w4                             = הָאָטָ֗ד\n",
      "Genesis 50:11\n",
      "                         BHS 28460, 28461                   = הָֽ/אָטָ֔ד\n",
      "                         OSM w8                             = הָֽאָטָ֔ד\n",
      "Numbers 13:21\n",
      "                         BHS 78125, 78126                   = לְ/בֹ֥א\n",
      "                         OSM w9                             = לְבֹ֥א\n",
      "Numbers 34:8\n",
      "                         BHS 91530, 91531                   = לְ/בֹ֣א\n",
      "                         OSM w4                             = לְבֹ֣א\n",
      "Deuteronomy 33:2\n",
      "                         BHS 112265, 112266                 = אשׁ/דת\n",
      "                         OSM w15                            = אשדת\n",
      "Joshua 13:5\n",
      "                         BHS 120469, 120470                 = לְ/בֹ֥וא\n",
      "                         OSM w13                            = לְב֥וֹא\n",
      "Joshua 18:24\n",
      "                         BHS 123359, 123360                 = ה/עמני\n",
      "                         OSM w2                             = העמני\n",
      "Joshua 18:28\n",
      "                         BHS 123394, 123395                 = הָ/אֶ֜לֶף\n",
      "                         OSM w2                             = הָאֶ֜לֶף\n",
      "Joshua 19:46\n",
      "                         BHS 123994, 123995                 = הַ/יַּרְקֹ֖ון\n",
      "                         OSM w2                             = הַיַּרְק֖וֹן\n",
      "Judges 3:3\n",
      "                         BHS 128774, 128775                 = לְ/בֹ֥וא\n",
      "                         OSM w15                            = לְב֥וֹא\n",
      "Judges 6:5\n",
      "                         BHS 130514, 130515                 = י/באו\n",
      "                         OSM w6                             = יבאו\n",
      "Judges 6:11\n",
      "                         BHS 130645, 130646                 = הָֽ/עֶזְרִ֑י\n",
      "                         OSM w12                            = הָֽעֶזְרִ֑י\n",
      "Judges 6:20\n",
      "                         BHS 130861, 130862                 = הַ/לָּ֔ז\n",
      "                         OSM w13                            = הַלָּ֔ז\n",
      "Judges 6:24\n",
      "                         BHS 130966, 130967                 = הָ/עֶזְרִֽי\n",
      "                         OSM w16                            = הָעֶזְרִֽי\n",
      "Judges 8:32\n",
      "                         BHS 132817, 132818                 = הָֽ/עֶזְרִֽי\n",
      "                         OSM w13                            = הָֽעֶזְרִֽי\n",
      "Judges 15:19\n",
      "                         BHS 137184, 137185                 = הַ/קֹּורֵא֙\n",
      "                         OSM w19                            = הַקּוֹרֵא֙\n",
      "1_Samuel 6:14\n",
      "                         BHS 144465, 144466                 = הַ/שִּׁמְשִׁי֙\n",
      "                         OSM w7                             = הַשִּׁמְשִׁי֙\n",
      "1_Samuel 6:18\n",
      "                         BHS 144606, 144607                 = הַ/שִּׁמְשִֽׁי\n",
      "                         OSM w29                            = הַשִּׁמְשִֽׁי\n",
      "1_Samuel 13:18\n",
      "                         BHS 148208, 148209                 = הַ/צְּבֹעִ֖ים\n",
      "                         OSM w15                            = הַצְּבֹעִ֖ים\n",
      "1_Samuel 14:1\n",
      "                         BHS 148335, 148336                 = הַ/לָּ֑ז\n",
      "                         OSM w18                            = הַלָּ֑ז\n",
      "1_Samuel 16:1\n",
      "                         BHS 150297, 150298                 = הַ/לַּחְמִ֔י\n",
      "                         OSM w24                            = הַלַּחְמִ֔י\n",
      "1_Samuel 16:18\n",
      "                         BHS 150666, 150667                 = הַ/לַּחְמִי֒\n",
      "                         OSM w10                            = הַלַּחְמִי֒\n",
      "1_Samuel 17:2\n",
      "                         BHS 150821, 150822                 = הָ/אֵלָ֑ה\n",
      "                         OSM w7                             = הָאֵלָ֑ה\n",
      "1_Samuel 17:19\n",
      "                         BHS 151163, 151164                 = הָֽ/אֵלָ֑ה\n",
      "                         OSM w7                             = הָֽאֵלָ֑ה\n",
      "1_Samuel 17:26\n",
      "                         BHS 151348, 151349                 = הַ/לָּ֔ז\n",
      "                         OSM w15                            = הַלָּ֔ז\n",
      "1_Samuel 17:58\n",
      "                         BHS 152160, 152161                 = הַ/לַּחְמִֽי\n",
      "                         OSM w14                            = הַלַּחְמִֽי\n",
      "1_Samuel 21:10\n",
      "                         BHS 154603, 154604                 = הָ/אֵלָ֗ה\n",
      "                         OSM w9                             = הָאֵלָ֗ה\n",
      "1_Samuel 23:28\n",
      "                         BHS 155965, 155966                 = הַֽ/מַּחְלְקֹֽות\n",
      "                         OSM w15                            = הַֽמַּחְלְקֽוֹת\n",
      "2_Samuel 3:26\n",
      "                         BHS 162277, 162278                 = הַ/סִּרָ֑ה\n",
      "                         OSM w12                            = הַסִּרָ֑ה\n",
      "2_Samuel 12:22\n",
      "                         BHS 166917, 166918                 = י/חנני\n",
      "                         OSM w11                            = יחנ\n",
      "2_Samuel 12:22\n",
      "                         BHS 166917, 166918                 = י/חנני\n",
      "                         OSM w11                            = ני\n",
      "2_Samuel 20:15\n",
      "                         BHS 173466, 173467                 = הַֽ/מַּעֲכָ֔ה\n",
      "                         OSM w6                             = הַֽמַּעֲכָ֔ה\n",
      "2_Samuel 21:16\n",
      "                         BHS 174125, 174126                 = בְּ/נֹ֜ב\n",
      "                         OSM w2                             = בְּנֹ֜ב\n",
      "2_Samuel 21:19\n",
      "                         BHS 174223, 174224                 = הַ/לַּחְמִ֗י\n",
      "                         OSM w13                            = הַלַּחְמִ֗י\n",
      "2_Samuel 23:8\n",
      "                         BHS 174932, 174933, 174934         = בַּ//שֶּׁ֜בֶת\n",
      "                         OSM w7                             = בַּשֶּׁ֜בֶת\n",
      "2_Samuel 23:33\n",
      "                         BHS 175372, 175373                 = הָ/ארָרִֽי\n",
      "                         OSM w6                             = הָארָרִֽי\n",
      "1_Kings 1:9\n",
      "                         BHS 176264, 176265                 = הַ/זֹּחֶ֔לֶת\n",
      "                         OSM w8                             = הַזֹּחֶ֔לֶת\n",
      "1_Kings 8:65\n",
      "                         BHS 183614, 183615                 = לְּ/בֹ֥וא\n",
      "                         OSM w12                            = לְּב֥וֹא\n",
      "1_Kings 16:34\n",
      "                         BHS 189794, 189795                 = הָ/אֱלִ֖י\n",
      "                         OSM w5                             = הָאֱלִ֖י\n",
      "2_Kings 4:25\n",
      "                         BHS 196995, 196996                 = הַ/לָּֽז\n",
      "                         OSM w21                            = הַלָּֽז\n",
      "2_Kings 7:15\n",
      "                         BHS 199376, 199377                 = בה/חפזם\n",
      "                         OSM w14                            = ב\n",
      "2_Kings 7:15\n",
      "                         BHS 199376, 199377                 = בה/חפזם\n",
      "                         OSM w14                            = החפז\n",
      "2_Kings 7:15\n",
      "                         BHS 199376, 199377                 = בה/חפזם\n",
      "                         OSM w14                            = ם\n",
      "2_Kings 14:25\n",
      "                         BHS 204159, 204160                 = לְּ/בֹ֥וא\n",
      "                         OSM w6                             = לְּב֥וֹא\n",
      "2_Kings 18:27\n",
      "                         BHS 207188, 207189                 = שׁ/יניהם\n",
      "                         OSM w25                            = שיני\n",
      "2_Kings 18:27\n",
      "                         BHS 207188, 207189                 = שׁ/יניהם\n",
      "                         OSM w25                            = הם\n",
      "2_Kings 19:13\n",
      "                         BHS 207681, 207682, 207683         = לָ//עִ֣יר\n",
      "                         OSM w7                             = לָעִ֣יר\n",
      "2_Kings 23:17\n",
      "                         BHS 210347, 210348                 = הַ/לָּ֔ז\n",
      "                         OSM w4                             = הַלָּ֔ז\n",
      "Isaiah 22:18\n",
      "                         BHS 219268, 219269, 219270         = כַּ//דּ֕וּר\n",
      "                         OSM w4                             = כַּדּ֕וּר\n",
      "Isaiah 36:12\n",
      "                         BHS 224046, 224047                 = שׁי/ניהם\n",
      "                         OSM w24                            = שיני\n",
      "Isaiah 36:12\n",
      "                         BHS 224046, 224047                 = שׁי/ניהם\n",
      "                         OSM w24                            = הם\n",
      "Isaiah 37:13\n",
      "                         BHS 224517, 224518, 224519         = לָ//עִ֣יר\n",
      "                         OSM w7                             = לָעִ֣יר\n",
      "Isaiah 49:13\n",
      "                         BHS 229391, 229392                 = י/פצחו\n",
      "                         OSM w5                             = יפצחו\n",
      "Jeremiah 6:21\n",
      "                         BHS 238058, 238059                 = י/אבדו\n",
      "                         OSM w18                            = יאבדו\n",
      "Jeremiah 6:29\n",
      "                         BHS 238177, 238178                 = אשׁ/תם\n",
      "                         OSM w3                             = אשת\n",
      "Jeremiah 6:29\n",
      "                         BHS 238177, 238178                 = אשׁ/תם\n",
      "                         OSM w3                             = ם\n",
      "Jeremiah 13:16\n",
      "                         BHS 241521, 241522                 = י/שׁית\n",
      "                         OSM w17                            = ישית\n",
      "Jeremiah 17:13\n",
      "                         BHS 243339, 243340                 = י/סורי\n",
      "                         OSM w7                             = יסור\n",
      "Jeremiah 17:13\n",
      "                         BHS 243339, 243340                 = י/סורי\n",
      "                         OSM w7                             = י\n",
      "Jeremiah 21:9\n",
      "                         BHS 245189, 245190                 = י/חיה\n",
      "                         OSM w14                            = יחיה\n",
      "Jeremiah 29:23\n",
      "                         BHS 250010, 250011                 = ה/וידע\n",
      "                         OSM w18                            = הו\n",
      "Jeremiah 29:23\n",
      "                         BHS 250010, 250011                 = ה/וידע\n",
      "                         OSM w19                            = ידע\n",
      "Jeremiah 38:2\n",
      "                         BHS 255634, 255635                 = י/חיה\n",
      "                         OSM w14                            = יחיה\n",
      "Jeremiah 48:18\n",
      "                         BHS 260639, 260640                 = י/שׁבי\n",
      "                         OSM w3                             = ישבי\n",
      "Ezekiel 36:35\n",
      "                         BHS 283125, 283126                 = הַ/לֵּ֨זוּ֙\n",
      "                         OSM w3                             = הַלֵּ֨זוּ֙\n",
      "Ezekiel 42:14\n",
      "                         BHS 287019, 287020                 = י/לבשׁו\n",
      "                         OSM w18                            = ילבשו\n",
      "Ezekiel 43:6\n",
      "                         BHS 287226, 287227                 = מִ/דַּבֵּ֥ר\n",
      "                         OSM w2                             = מִדַּבֵּ֥ר\n",
      "Ezekiel 45:5\n",
      "                         BHS 288552, 288553                 = י/היה\n",
      "                         OSM w8                             = יהיה\n",
      "Ezekiel 47:15\n",
      "                         BHS 290023, 290024                 = לְ/בֹ֥וא\n",
      "                         OSM w11                            = לְב֥וֹא\n",
      "Ezekiel 47:16\n",
      "                         BHS 290038, 290039                 = הַ/תִּיכֹ֔ון\n",
      "                         OSM w12                            = הַתִּיכ֔וֹן\n",
      "Ezekiel 47:20\n",
      "                         BHS 290129, 290130                 = לְ/בֹ֣וא\n",
      "                         OSM w8                             = לְב֣וֹא\n",
      "Ezekiel 48:1\n",
      "                         BHS 290210, 290211                 = לְֽ/בֹוא\n",
      "                         OSM w10                            = לְֽבוֹא\n",
      "Amos 6:14\n",
      "                         BHS 297188, 297189                 = לְּ/בֹ֥וא\n",
      "                         OSM w14                            = לְּב֥וֹא\n",
      "Micah 1:10\n",
      "                         BHS 299725, 299726                 = לְ/עַפְרָ֔ה\n",
      "                         OSM w8                             = לְעַפְרָ֔ה\n",
      "Nahum 3:3\n",
      "                         BHS 301919, 301920                 = י/כשׁלו\n",
      "                         OSM w14                            = יכשלו\n",
      "Zechariah 2:8\n",
      "                         BHS 305502, 305503                 = הַ/לָּ֖ז\n",
      "                         OSM w7                             = הַלָּ֖ז\n",
      "Zechariah 9:8\n",
      "                         BHS 307601, 307602                 = מִ/צָּבָה֙\n",
      "                         OSM w3                             = מִצָּבָה֙\n",
      "Zechariah 14:6\n",
      "                         BHS 309064, 309065                 = י/קפאון\n",
      "                         OSM w8                             = יקפאו\n",
      "Zechariah 14:6\n",
      "                         BHS 309064, 309065                 = י/קפאון\n",
      "                         OSM w8                             = ן\n",
      "Psalms 9:1\n",
      "                         BHS 311564, 311565, 311566         = לַ//בֵּ֗ן\n",
      "                         OSM w3                             = לַבֵּ֗ן\n",
      "Psalms 9:10\n",
      "                         BHS 311656, 311657, 311658         = בַּ//צָּרָֽה\n",
      "                         OSM w7                             = בַּצָּרָֽה\n",
      "Psalms 10:1\n",
      "                         BHS 311776, 311777, 311778         = בַּ//צָּרָֽה\n",
      "                         OSM w7                             = בַּצָּרָֽה\n",
      "Psalms 10:10\n",
      "                         BHS 311885, 311886                 = חל/כאים\n",
      "                         OSM w5                             = חלכאים\n",
      "Psalms 41:3\n",
      "                         BHS 317247, 317248                 = י/אשׁר\n",
      "                         OSM w4                             = יאשר\n",
      "Psalms 55:16\n",
      "                         BHS 319513, 319514                 = ישׁי/מות\n",
      "                         OSM w1                             = ישימות\n",
      "Psalms 123:4\n",
      "                         BHS 332922, 332923                 = גא/יונים\n",
      "                         OSM w8                             = גְאֵ֥יוֹנִֽים\n",
      "Job 6:14\n",
      "                         BHS 337706, 337707                 = מֵ/רֵעֵ֣הוּ\n",
      "                         OSM w2                             = מֵרֵעֵ֣\n",
      "Job 6:14\n",
      "                         BHS 337706, 337707                 = מֵ/רֵעֵ֣הוּ\n",
      "                         OSM w2                             = הוּ\n",
      "Job 9:30\n",
      "                         BHS 338535, 338536                 = ב/מו\n",
      "                         OSM w3                             = במו\n",
      "Job 10:20\n",
      "                         BHS 338784, 338785                 = י/חדל\n",
      "                         OSM w4                             = יחדל\n",
      "Job 10:20\n",
      "                         BHS 338786, 338787                 = י/שׁית\n",
      "                         OSM w5                             = ישית\n",
      "Job 38:12\n",
      "                         BHS 345551, 345552                 = ידעת/ה\n",
      "                         OSM w4                             = ידעתה\n",
      "Proverbs 18:17\n",
      "                         BHS 351776, 351777                 = י/בא\n",
      "                         OSM w4                             = יבא\n",
      "Proverbs 19:4\n",
      "                         BHS 351882, 351883                 = מֵ/רֵ֥עהוּ\n",
      "                         OSM w6                             = מֵרֵ֥ע\n",
      "Proverbs 19:4\n",
      "                         BHS 351882, 351883                 = מֵ/רֵ֥עהוּ\n",
      "                         OSM w6                             = הוּ\n",
      "Proverbs 20:4\n",
      "                         BHS 352164, 352165                 = י/שׁאל\n",
      "                         OSM w5                             = ישאל\n",
      "Ecclesiastes 6:10\n",
      "                         BHS 361436, 361437                 = שׁה/תקיף\n",
      "                         OSM w14                            = ש\n",
      "Ecclesiastes 6:10\n",
      "                         BHS 361436, 361437                 = שׁה/תקיף\n",
      "                         OSM w14                            = התקיף\n",
      "Daniel 8:16\n",
      "                         BHS 375554, 375555                 = הַ/לָּ֖ז\n",
      "                         OSM w10                            = הַלָּ֖ז\n",
      "Daniel 11:12\n",
      "                         BHS 377178, 377179                 = י/רום\n",
      "                         OSM w3                             = ירום\n",
      "Ezra 2:61\n",
      "                         BHS 378928, 378929                 = הַ/קֹּ֑וץ\n",
      "                         OSM w6                             = הַקּ֑וֹץ\n",
      "Ezra 10:29\n",
      "                         BHS 383320, 383321                 = י/רמות\n",
      "                         OSM w8                             = ירמות\n",
      "Nehemiah 3:4\n",
      "                         BHS 384335, 384336                 = הַ/קֹּ֔וץ\n",
      "                         OSM w8                             = הַקּ֔וֹץ\n",
      "Nehemiah 3:6\n",
      "                         BHS 384370, 384371                 = הַ/יְשָׁנָ֜ה\n",
      "                         OSM w3                             = הַיְשָׁנָ֜ה\n",
      "Nehemiah 3:12\n",
      "                         BHS 384483, 384484                 = הַ/לֹּוחֵ֔שׁ\n",
      "                         OSM w6                             = הַלּוֹחֵ֔שׁ\n",
      "Nehemiah 3:21\n",
      "                         BHS 384676, 384677                 = הַ/קֹּ֖וץ\n",
      "                         OSM w7                             = הַקּ֖וֹץ\n",
      "Nehemiah 7:63\n",
      "                         BHS 386961, 386962                 = הַ/קֹּ֑וץ\n",
      "                         OSM w6                             = הַקּ֑וֹץ\n",
      "Nehemiah 10:25\n",
      "                         BHS 388830, 388831                 = הַ/לֹּוחֵ֥שׁ\n",
      "                         OSM w1                             = הַלּוֹחֵ֥שׁ\n",
      "Nehemiah 11:35\n",
      "                         BHS 389765, 389766                 = הַ/חֲרָשִֽׁים\n",
      "                         OSM w4                             = הַחֲרָשִֽׁים\n",
      "Nehemiah 12:39\n",
      "                         BHS 390290, 390291                 = הַ/יְשָׁנָ֜ה\n",
      "                         OSM w6                             = הַיְשָׁנָ֜ה\n",
      "1_Chronicles 4:7\n",
      "                         BHS 392906, 392907                 = י/צחר\n",
      "                         OSM w4                             = יצחר\n",
      "1_Chronicles 7:34\n",
      "                         BHS 395488, 395489                 = י/חבה\n",
      "                         OSM w5                             = יחבה\n",
      "1_Chronicles 13:5\n",
      "                         BHS 398556, 398557                 = לְ/בֹ֣וא\n",
      "                         OSM w10                            = לְב֣וֹא\n",
      "1_Chronicles 18:12\n",
      "                         BHS 401053, 401054                 = הַ/מֶּ֔לַח\n",
      "                         OSM w8                             = הַמֶּ֔לַח\n",
      "1_Chronicles 24:10\n",
      "                         BHS 403641, 403642                 = הַ/קֹּוץ֙\n",
      "                         OSM w1                             = הַקּוֹץ֙\n",
      "1_Chronicles 27:12\n",
      "                         BHS 405108, 405109                 = בן/ימיני\n",
      "                         OSM w6                             = בנימיני\n",
      "2_Chronicles 7:8\n",
      "                         BHS 410242, 410243                 = לְּ/בֹ֥וא\n",
      "                         OSM w15                            = לְּב֥וֹא\n",
      "2_Chronicles 25:11\n",
      "                         BHS 419159, 419160                 = הַ/מֶּ֑לַח\n",
      "                         OSM w8                             = הַמֶּ֑לַח\n",
      "2_Chronicles 36:21\n",
      "                         BHS 426512, 426513                 = הָ/שַּׁמָּה֙\n",
      "                         OSM w13                            = הָשַּׁמָּ\n",
      "2_Chronicles 36:21\n",
      "                         BHS 426512, 426513                 = הָ/שַּׁמָּה֙\n",
      "                         OSM w13                            = ה֙\n"
     ]
    }
   ],
   "source": [
    "for j in multipleOSM:\n",
    "    ws = bhsFromOsm[j]\n",
    "    problematic |= set(ws)\n",
    "    if not SCRIPT:\n",
    "        BHSvsOSM(ws, [j])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Consonantal sanity\n",
    "Which non-empty BHSA words are not the concatenation of their OSM morphemes?\n",
    "\n",
    "We do not consider the cases where more than one BHSA word corresponds to an OSM morpheme,\n",
    "because we have already gathered those cases above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         20s insane BHS words:        6\n",
      "|         20s insane OSM morphemes:    8\n"
     ]
    }
   ],
   "source": [
    "insaneBHS = set()  # alignment problems by BHSA slot number\n",
    "insaneOSM = set()  # alignment problems by OSM morpheme index in osmMorphemes\n",
    "\n",
    "# We compute the slot numbers of that are part of a multiple slot alignment to a morpheme\n",
    "multipleBHS = reduce(set.union, (bhsFromOsm[j] for j in multipleOSM), set())\n",
    "\n",
    "# Gather the insanities\n",
    "for (w, js) in osmFromBhs.items():\n",
    "    if w in multipleBHS:\n",
    "        continue\n",
    "    cw = toCons(F.g_cons_utf8.v(w))\n",
    "    cjs = \"\".join(osmMorphemes[j][1] for j in js)\n",
    "    if unFinal(cw) != unFinal(cjs):\n",
    "        insaneBHS.add(w)\n",
    "        insaneOSM |= set(js)\n",
    "utils.caption(0, \"insane BHS words:     {:>4}\".format(len(insaneBHS)))\n",
    "utils.caption(0, \"insane OSM morphemes: {:>4}\".format(len(insaneOSM)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ezekiel 4:6\n",
      "                         BHS 266192                         = ימוני\n",
      "                         OSM w7                             = ימיני\n",
      "Ezekiel 43:11\n",
      "                         BHS 287363                         = צורתו\n",
      "                         OSM w17, w17                       = צורת/י\n",
      "Daniel 10:19\n",
      "                         BHS 376869                         = כְ\n",
      "                         OSM w10                            = בְ\n",
      "Ezra 10:44\n",
      "                         BHS 383409                         = נשׂאו\n",
      "                         OSM w3, w3                         = נשא/י\n",
      "Nehemiah 2:13\n",
      "                         BHS 384053                         = הם\n",
      "                         OSM w17                            = ה\n",
      "Nehemiah 2:13\n",
      "                         BHS 384054                         = פרוצים\n",
      "                         OSM w17                            = מפרוצים\n"
     ]
    }
   ],
   "source": [
    "for w in sorted(insaneBHS):\n",
    "    problematic.add(w)\n",
    "    js = osmFromBhs[w]\n",
    "    if not SCRIPT:\n",
    "        BHSvsOSM([w], js)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More than two morphemes per word\n",
    "\n",
    "Let's study the mapping of BHSA words to OSM morphemes in a bit more detail.\n",
    "We are interested in the question: to how many morphemes can words map?\n",
    "\n",
    "Later we shall see that we can deal with 1 and 2 morphemes per word.\n",
    "\n",
    "We deem words that map to more than two morphemes problematic.\n",
    "\n",
    "This turns out to be a very small minority."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         20s  1 morphemes per word: 370680\n",
      "|         20s  2 morphemes per word:  49400\n",
      "|         20s  3 morphemes per word:     27\n",
      "|         20s  4 morphemes per word:      2\n"
     ]
    }
   ],
   "source": [
    "morphemesPerWord = collections.Counter()\n",
    "tooMany = set()\n",
    "for (w, js) in osmFromBhs.items():\n",
    "    n = len(js)\n",
    "    morphemesPerWord[n] += 1\n",
    "    if n > 2:\n",
    "        tooMany.add(w)\n",
    "\n",
    "for (ln, amount) in sorted(morphemesPerWord.items()):\n",
    "    utils.caption(0, \"{:>2} morphemes per word: {:>6}\".format(ln, amount))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Numbers 33:46\n",
      "                         BHS 91209                          = עַלְמֹ֥ן דִּבְלָתָֽיְמָה\n",
      "                         OSM w5, w6, w6                     = עַלְמֹ֥ן/דִּבְלָתָֽיְמָ/ה\n",
      "Numbers 33:47\n",
      "                         BHS 91213                          = עַלְמֹ֣ן דִּבְלָתָ֑יְמָה\n",
      "                         OSM w2, w3, w3                     = עַלְמֹ֣ן/דִּבְלָתָ֑יְמָ/ה\n",
      "Deuteronomy 10:6\n",
      "                         BHS 99249                          = בְּאֵרֹ֥ת בְּנֵי־יַעֲקָ֖ן\n",
      "                         OSM w4, w5, w6                     = בְּאֵרֹ֥ת/בְּנֵי/יַעֲקָ֖ן\n",
      "Joshua 13:17\n",
      "                         BHS 120713                         = בֵ֖ית בַּ֥עַל מְעֹֽון\n",
      "                         OSM w9, w10, w11                   = בֵ֖ית/בַּ֥עַל/מְעֽוֹן\n",
      "Joshua 15:32\n",
      "                         BHS 121905                         = עַ֣יִן וְרִמֹּ֑ון\n",
      "                         OSM w3, w4, w4                     = עַ֣יִן/וְ/רִמּ֑וֹן\n",
      "Joshua 15:62\n",
      "                         BHS 122150                         = עִיר־הַמֶּ֖לַח\n",
      "                         OSM w2, w3, w3                     = עִיר/הַ/מֶּ֖לַח\n",
      "Judges 8:13\n",
      "                         BHS 132427                         = מַעֲלֵ֖ה הֶחָֽרֶס\n",
      "                         OSM w7, w8, w8                     = מַעֲלֵ֖ה/הֶ/חָֽרֶס\n",
      "1_Samuel 4:1\n",
      "                         BHS 143295                         = הָאֶ֣בֶן הָעֵ֔זֶר\n",
      "                         OSM w13, w13, w14                  = הָ/אֶ֣בֶן/הָעֵ֔זֶר\n",
      "1_Samuel 25:3\n",
      "                         BHS 156565                         = כלבו\n",
      "                         OSM w17, w17, w17                  = כ/לב/ו\n",
      "1_Kings 7:36\n",
      "                         BHS 181670                         = ומסגרתיה\n",
      "                         OSM w6, w6, w6                     = ו/מסגרתי/ה\n",
      "1_Kings 15:20\n",
      "                         BHS 188754                         = אָבֵ֣ל בֵּֽית־מַעֲכָ֑ה\n",
      "                         OSM w22, w23, w24                  = אָבֵ֣ל/בֵּֽית/מַעֲכָ֑ה\n",
      "2_Kings 7:15\n",
      "                         BHS 199376                         = בה\n",
      "                         OSM w14, w14, w14                  = ב/החפז/ם\n",
      "2_Kings 7:15\n",
      "                         BHS 199377                         = חפזם\n",
      "                         OSM w14, w14, w14                  = ב/החפז/ם\n",
      "2_Kings 10:12\n",
      "                         BHS 201428                         = בֵּֽית־עֵ֥קֶד הָרֹעִ֖ים\n",
      "                         OSM w6, w7, w8                     = בֵּֽית/עֵ֥קֶד/הָרֹעִ֖ים\n",
      "2_Kings 15:29\n",
      "                         BHS 204852                         = אָבֵ֣ל בֵּֽית־מַעֲכָ֡ה\n",
      "                         OSM w14, w15, w16                  = אָבֵ֣ל/בֵּֽית/מַעֲכָ֡ה\n",
      "Isaiah 8:1\n",
      "                         BHS 214752                         = מַהֵ֥ר שָׁלָ֖ל חָ֥שׁ בַּֽז\n",
      "                         OSM w12, w13, w14, w15             = מַהֵ֥ר/שָׁלָ֖ל/חָ֥שׁ/בַּֽז\n",
      "Isaiah 8:3\n",
      "                         BHS 214783                         = מַהֵ֥ר שָׁלָ֖ל חָ֥שׁ בַּֽז\n",
      "                         OSM w12, w13, w14, w15             = מַהֵ֥ר/שָׁלָ֖ל/חָ֥שׁ/בַּֽז\n",
      "Jeremiah 39:3\n",
      "                         BHS 256418                         = נֵרְגַ֣ל שַׂר־֠אֶצֶר\n",
      "                         OSM w9, w10, w11                   = נֵרְגַ֣ל/שַׂר/אֶ֠צֶר\n",
      "Jeremiah 39:3\n",
      "                         BHS 256423                         = נֵרְגַ֤ל שַׂר־אֶ֨צֶר֙\n",
      "                         OSM w18, w19, w20                  = נֵרְגַ֤ל/שַׂר/אֶ֨צֶר֙\n",
      "Jeremiah 39:13\n",
      "                         BHS 256650                         = נֵרְגַ֥ל שַׂר־אֶ֖צֶר\n",
      "                         OSM w8, w9, w10                    = נֵרְגַ֥ל/שַׂר/אֶ֖צֶר\n",
      "Ezekiel 44:24\n",
      "                         BHS 288287                         = ושׁפטהו\n",
      "                         OSM w7, w7, w7                     = ו/שפט/הו\n",
      "Psalms 91:12\n",
      "                         BHS 326518                         = יִשָּׂא֑וּנְךָ\n",
      "                         OSM w3, w3, w3                     = יִשָּׂא֑וּ/נְ/ךָ\n",
      "Job 30:4\n",
      "                         BHS 343152                         = לַחְמָֽם\n",
      "                         OSM w7, w7, w7                     = לַ/חְמָֽ/ם\n",
      "Proverbs 11:3\n",
      "                         BHS 349658                         = ושׁדם\n",
      "                         OSM w6, w6, w6                     = ו/שד/ם\n",
      "Proverbs 12:26\n",
      "                         BHS 350148                         = מֵרֵעֵ֣הוּ\n",
      "                         OSM w2, w2, w2                     = מֵ/רֵעֵ֣/הוּ\n",
      "Daniel 11:26\n",
      "                         BHS 377478                         = פַת־בָּגֹ֛ו\n",
      "                         OSM w2, w3, w3                     = פַת/בָּג֛/וֹ\n",
      "1_Chronicles 2:54\n",
      "                         BHS 392520                         = עַטְרֹ֖ות בֵּ֣ית יֹואָ֑ב\n",
      "                         OSM w6, w7, w8                     = עַטְר֖וֹת/בֵּ֣ית/יוֹאָ֑ב\n",
      "2_Chronicles 32:21\n",
      "                         BHS 423602                         = מיציאו\n",
      "                         OSM w20, w20, w20                  = מ/יציא/ו\n",
      "2_Chronicles 34:6\n",
      "                         BHS 424586                         = הר בתיהם\n",
      "                         OSM w7, w8, w8                     = הר/בתי/הם\n"
     ]
    }
   ],
   "source": [
    "for w in sorted(tooMany):\n",
    "    js = osmFromBhs[w]\n",
    "    if not SCRIPT:\n",
    "        BHSvsOSM([w], js)\n",
    "    problematic.add(w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exceptions\n",
    "\n",
    "Finally, we inspect the cases that correspond to the manual exceptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Isaiah 9:6\n",
      "                         BHS 215256                         = מרבה\n",
      "                         OSM w1                             = םרבה\n",
      "Ezekiel 4:6\n",
      "                         BHS 266192                         = ימוני\n",
      "                         OSM w7                             = ימיני\n",
      "Ezekiel 43:11\n",
      "                         BHS 287363                         = צורתו\n",
      "                         OSM w17, w17                       = צורת/י\n",
      "Daniel 10:19\n",
      "                         BHS 376869                         = כְ\n",
      "                         OSM w10                            = בְ\n",
      "Ezra 10:44\n",
      "                         BHS 383409                         = נשׂאו\n",
      "                         OSM w3, w3                         = נשא/י\n",
      "Nehemiah 2:13\n",
      "                         BHS 384053                         = הם\n",
      "                         OSM w17                            = ה\n",
      "Nehemiah 2:13\n",
      "                         BHS 384054                         = פרוצים\n",
      "                         OSM w17                            = מפרוצים\n",
      "1_Chronicles 27:12\n",
      "                         BHS 405108, 405109                 = בן/ימיני\n",
      "                         OSM w6                             = בנימיני\n"
     ]
    }
   ],
   "source": [
    "for (w, n) in exceptions.items():\n",
    "    if n > 0:\n",
    "        js = osmFromBhs[w]\n",
    "        if not SCRIPT:\n",
    "            BHSvsOSM([w], js)\n",
    "        problematic.add(w)\n",
    "    else:\n",
    "        j = osmFromBhs[w][0]\n",
    "        ws = bhsFromOsm[j]\n",
    "        if not SCRIPT:\n",
    "            BHSvsOSM(ws, [j])\n",
    "        problematic |= set(ws)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the number of problematic words in the BHSA that we will exclude from comparisons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         20s There are 259 problematic words in the BHSA wrt to OSM\n",
      "|         20s These will be excluded from further comparisons\n"
     ]
    }
   ],
   "source": [
    "utils.caption(\n",
    "    0, f\"There are {len(problematic)} problematic words in the BHSA wrt to OSM\"\n",
    ")\n",
    "utils.caption(0, \"These will be excluded from further comparisons\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missing morphology\n",
    "\n",
    "We make a list of word nodes for which no morpheme has been tagged with morphology.\n",
    "Only for non-empty words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "noMorphWords = set()\n",
    "for w in F.otype.s(\"word\"):\n",
    "    if not F.g_word_utf8.v(w):\n",
    "        continue\n",
    "    hasMorph = False\n",
    "    for j in osmFromBhs.get(w, []):\n",
    "        if osmMorphemes[j][2]:\n",
    "            hasMorph = True\n",
    "            break\n",
    "    if not hasMorph:\n",
    "        noMorphWords.add(w)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         20s There is OSM morphology for all non-empty BHSA words\n"
     ]
    }
   ],
   "source": [
    "if len(noMorphWords):\n",
    "    utils.caption(0, f\"No OSM morphology for {len(noMorphWords)} non-empty BHSA words\")\n",
    "else:\n",
    "    utils.caption(0, \"There is OSM morphology for all non-empty BHSA words\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get a feeling for how the non-tagged morphemes are distributed.\n",
    "First we represent them as a list of intervals, using a utility function of TF,\n",
    "and then we get an overview of the lengths of the intervals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|         20s no non-marked-up stretches\n"
     ]
    }
   ],
   "source": [
    "noMorphIntervals = rangesFromSet(noMorphWords)\n",
    "\n",
    "noMorphLengths = collections.Counter()\n",
    "\n",
    "for interval in noMorphIntervals:\n",
    "    noMorphLengths[interval[1] - interval[0] + 1] += 1\n",
    "\n",
    "if noMorphLengths:\n",
    "    utils.caption(0, \"Non-marked-up stretches having length x: y times\")\n",
    "    for (ln, amount) in sorted(noMorphLengths.items()):\n",
    "        utils.caption(0, \"{:>4}: {:>5}\".format(ln, amount))\n",
    "else:\n",
    "    utils.caption(0, \"no non-marked-up stretches\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Data generation\n",
    "\n",
    "We now proceed to compile the OSM morphology into Text-Fabric features.\n",
    "\n",
    "The basic idea is: create a feature `osm` and for each BHSA word, let it contain the contents of the\n",
    "corresponding `morph` attribute in the OSM source.\n",
    "\n",
    "There are several things to deal with, or not to deal with.\n",
    "\n",
    "## Problematic cases\n",
    "We will ignore the problematic cases. More precisely, whenever a BHSA word belongs to a problematic case,\n",
    "as diagnosed before, we fill its `osm` feature with the value `*`.\n",
    "\n",
    "## Multiplicity of morphemes\n",
    "There are BHSA words that do not correspond to OSM morphemes. The empty words. We will not give them an `osm` value.\n",
    "\n",
    "There are BHSA words that correspond to more than two morphemes. We have added them to our problematic list.\n",
    "\n",
    "The vast majority of BHSA words correspond to a single OSM morpheme.\n",
    "The `osm` feature of those words will be filled\n",
    "with the `morph` attribute part of the corresponding OSM morpheme. No problem here.\n",
    "\n",
    "The remaining cases consist of BHSA words that correspond to exactly two morphemes.\n",
    "We will use the value of the `morph` of the first morpheme to fill the `osm` feature for such words,\n",
    "and we will make a new feature, `osm_sf` and fill it with the `morph` of the second morpheme.\n",
    "\n",
    "So, we will create a TF module consisting of two features: `osm` and `osm_sf` (`osm` suffix).\n",
    "\n",
    "Let's assemble the feature data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "osmData = {}\n",
    "osm_sfData = {}\n",
    "for (w, js) in osmFromBhs.items():\n",
    "    if w in problematic:\n",
    "        osmData[w] = \"*\"\n",
    "        continue\n",
    "    osmData[w] = osmMorphemes[js[0]][2]\n",
    "    if len(js) > 1:\n",
    "        osm_sfData[w] = osmMorphemes[js[1]][2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "genericMetaPath = f\"{thisRepo}/yaml/generic.yaml\"\n",
    "bridgingMetaPath = f\"{thisRepo}/yaml/bridging.yaml\"\n",
    "\n",
    "with open(genericMetaPath) as fh:\n",
    "    genericMeta = yaml.load(fh, Loader=yaml.FullLoader)\n",
    "    genericMeta[\"version\"] = VERSION\n",
    "with open(bridgingMetaPath) as fh:\n",
    "    bridgingMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))\n",
    "\n",
    "metaData = {\"\": genericMeta, **bridgingMeta}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "nodeFeatures = dict(osm=osmData, osm_sf=osm_sfData)\n",
    "\n",
    "for f in nodeFeatures:\n",
    "    metaData[f][\"valueType\"] = \"str\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And combine it with a bit of metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".         39s Writing tree feature to TF                                                     .\n",
      "..............................................................................................\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "utils.caption(4, \"Writing tree feature to TF\")\n",
    "TFw = Fabric(locations=thisTempTf, silent=True)\n",
    "TFw.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Diffs\n",
    "\n",
    "Check differences with previous versions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".      5m 50s Check differences with previous version                                        .\n",
      "..............................................................................................\n",
      "|      5m 50s \tno features to add\n",
      "|      5m 50s \tno features to delete\n",
      "|      5m 50s \t2 features in common\n",
      "|      5m 50s osm                       ... no changes\n",
      "|      5m 51s osm_sf                    ... no changes\n",
      "|      5m 51s Done\n"
     ]
    }
   ],
   "source": [
    "utils.checkDiffs(thisTempTf, thisTf, only=set(nodeFeatures))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deliver\n",
    "\n",
    "Copy the new TF features from the temporary location where they have been created to their final destination."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".      5m 56s Deliver data set to /Users/dirk/github/etcbc/bridging/tf/2021                  .\n",
      "..............................................................................................\n"
     ]
    }
   ],
   "source": [
    "utils.deliverDataset(thisTempTf, thisTf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compile TF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".      6m 06s Load and compile the new TF features                                           .\n",
      "..............................................................................................\n"
     ]
    }
   ],
   "source": [
    "utils.caption(4, \"Load and compile the new TF features\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is Text-Fabric 9.0.4\n",
      "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n",
      "\n",
      "117 features found and 0 ignored\n",
      "  0.00s loading features ...\n",
      "   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API\n",
      "   |     0.85s T osm                  from ~/github/etcbc/bridging/tf/2021\n",
      "   |     0.15s T osm_sf               from ~/github/etcbc/bridging/tf/2021\n",
      "  4.50s All features loaded/computed - for details use TF.isLoaded()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('Computed',\n",
       "  'computed-data',\n",
       "  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),\n",
       " ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),\n",
       " ('Fabric', 'loading', ('TF',)),\n",
       " ('Locality', 'locality', ('L Locality',)),\n",
       " ('Nodes', 'navigating-nodes', ('N Nodes',)),\n",
       " ('Features',\n",
       "  'node-features',\n",
       "  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),\n",
       " ('Search', 'search', ('S Search',)),\n",
       " ('Text', 'text', ('T Text',))]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "TF = Fabric(locations=[coreTf, thisTf], modules=[\"\"])\n",
    "api = TF.load(\"language \" + \" \".join(nodeFeatures))\n",
    "api.makeAvailableIn(globals())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".      6m 53s Basic tests                                                                    .\n",
      "..............................................................................................\n",
      "..............................................................................................\n",
      ".      6m 53s Language according to BHSA and OSM                                             .\n",
      "..............................................................................................\n"
     ]
    }
   ],
   "source": [
    "utils.caption(4, \"Basic tests\")\n",
    "utils.caption(4, \"Language according to BHSA and OSM\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|     16m 49s No other languages encountered than A, H\n",
      "|     16m 49s Language discrepancies: 2\n",
      "|     16m 49s Psalms 116:12 word 330987:        כָּֽל - BHSA: Aramaic; OSM: Hebrew\n",
      "|     16m 49s Psalms 116:12 word 330988: תַּגְמוּלֹ֥והִי - BHSA: Aramaic; OSM: Hebrew\n"
     ]
    }
   ],
   "source": [
    "langBhsFromOsm = dict(A=\"Aramaic\", H=\"Hebrew\")\n",
    "langOsmFromBhs = dict((y, x) for (x, y) in langBhsFromOsm.items())\n",
    "\n",
    "xLanguage = set()\n",
    "strangeLanguage = collections.Counter()\n",
    "\n",
    "for w in F.otype.s(\"word\"):\n",
    "    osm = F.osm.v(w)\n",
    "    if osm is None or osm == \"\" or osm == \"*\":\n",
    "        continue\n",
    "    osmLanguage = osm[0]\n",
    "    trans = langBhsFromOsm.get(osmLanguage, None)\n",
    "    if trans is None:\n",
    "        strangeLanguage[osmLanguage] += 1\n",
    "    else:\n",
    "        if langBhsFromOsm[osm[0]] != F.language.v(w):\n",
    "            xLanguage.add(w)\n",
    "\n",
    "if strangeLanguage:\n",
    "    utils.caption(0, \"Strange languages\")\n",
    "    for (ln, amount) in sorted(strangeLanguage.items()):\n",
    "        utils.caption(0, \"Strange language {}: {:>5}x\".format(ln, amount))\n",
    "else:\n",
    "    utils.caption(\n",
    "        0, \"No other languages encountered than {}\".format(\", \".join(langBhsFromOsm))\n",
    "    )\n",
    "utils.caption(0, \"Language discrepancies: {}\".format(len(xLanguage)))\n",
    "for w in sorted(xLanguage):\n",
    "    passage = \"{} {}:{}\".format(*T.sectionFromNode(w))\n",
    "    utils.caption(0, f\"{passage} word {w}: {F.g_word_utf8.v(w):>12} - BHSA: {F.language.v(w)}; OSM: {langBhsFromOsm[F.osm.v(w)[0]]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# End of pipeline\n",
    "\n",
    "If this notebook is run with the purpose of generating data, this is the end then.\n",
    "\n",
    "After this tests and examples are run."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "In[41]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "if SCRIPT:\n",
    "    stop(good=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can write notebooks to process BHSA data and grab the OSM morphology as you go, like so:\n",
    "\n",
    "```\n",
    "A = use(\"ETCBC/bhsa\", mod=\"etcbc/bridging/tf\", hoist=globals())\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tests and examples\n",
    "\n",
    "Before we flesh out the alignment algorithm,\n",
    "let's find the first point where BHSA and OSM diverge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|     17m 02s Mismatch at BHS-62 OSM-61: bhs=[] osm=[אור]\n"
     ]
    }
   ],
   "source": [
    "for (i, w) in enumerate(F.otype.s(\"word\")):\n",
    "    bhs = toCons(F.g_cons_utf8.v(w))\n",
    "    osm = osmMorphemes[i][1]\n",
    "    if bhs != osm:\n",
    "        utils.caption(\n",
    "            0, \"Mismatch at BHS-{} OSM-{}: bhs=[{}] osm=[{}]\".format(w, i, bhs, osm)\n",
    "        )\n",
    "        break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('Genesis', 1, 5)\n",
      "BHS\n",
      "word  62 = []\n",
      "word  63 = [אור]\n",
      "word  64 = [יום]\n",
      "word  65 = [ו]\n",
      "word  66 = [ל]\n",
      "OSM\n",
      "morph 61 = [אור]\n",
      "morph 62 = [יום]\n",
      "morph 63 = [ו]\n",
      "morph 64 = [ל]\n",
      "morph 65 = [חשך]\n"
     ]
    }
   ],
   "source": [
    "showCase(62, 61, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a case of an empty article in the BHSA.\n",
    "Let's circumvent this, and move on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|     17m 52s Mismatch at BHS-194 OSM-187:\n",
      "bhs=[מינו]\n",
      "os=[מינ]\n"
     ]
    }
   ],
   "source": [
    "j = -1\n",
    "for w in F.otype.s(\"word\"):\n",
    "    bhs = toCons(F.g_cons_utf8.v(w))\n",
    "    if bhs == \"\":\n",
    "        continue\n",
    "    j += 1\n",
    "    osm = osmMorphemes[j][1]\n",
    "    if bhs != osm:\n",
    "        utils.caption(\n",
    "            0,\n",
    "            \"\"\"Mismatch at BHS-{} OSM-{}:\\nbhs=[{}]\\nos=[{}]\"\"\".format(w, j, bhs, osm),\n",
    "        )\n",
    "        break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('Genesis', 1, 11)\n",
      "BHS\n",
      "word  194 = [מינו]\n",
      "word  195 = [אשר]\n",
      "word  196 = [זרעו]\n",
      "word  197 = [בו]\n",
      "word  198 = [על]\n",
      "OSM\n",
      "morph 187 = [מינ]\n",
      "morph 188 = [ו]\n",
      "morph 189 = [אשר]\n",
      "morph 190 = [זרע]\n",
      "morph 191 = [ו]\n"
     ]
    }
   ],
   "source": [
    "showCase(194, 187, 5)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "encoding": "# -*- coding: utf-8 -*-"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}