{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Heads 😶\n",
    "## By Cody Kingham, in collaboration with Christiaan Erwich\n",
    "\n",
    "## Problem Description\n",
    "The ETCBC's BHSA core data does not contain the standard syntax tree format. This also means that syntactic and functional relationships between individual words are not mapped in a transparent or easily accessible way. In some cases, fine-grained relationships are ignored altogether. For example, for a given noun phrase (NP), there is no explicit way of obtaining its head noun (i.e. the noun itself without any modifying elements). This causes numerous problems for research in the realm of semantics. For instance, it is currently very difficult to calculate the complete person, gender, and number (PGN) of a given subject phrase. That is because PGN is stored at the word level only. But this is a very inadequate representation. Phrases in the ETCBC often contain coordinate relationships within the phrase. So even if one selects the first \"noun\" in the phrase and checks for its PGN value, they may overlook the presence of another noun which makes the phrase plural. Ideally, the phrase itself would have a PGN feature. But before this kind of data is created, it is necessary to separate the head words of a phrase from their modifying elements such as adjectives, determiners, or nouns in construct (genitive) relations.\n",
    "\n",
    "A head word can be defined as the word for which a phrase type is named after. Examples of phrase types are \"NP\" (noun phrase) or \"VP\" (verb phrase). In this notebook, we experiment with and build the functions stored in `heads.py` in order to export a set of Text-Fabric edge features. The edge features represent a mapping from a phrase node to its head element. \n",
    "\n",
    "This goal requires us to think carefully about the way inter-word, semantic relations are reflected in the ETCBC's data. The ETCBC *does* contain some rudimentary semantic embeddings through the so-called [subphrase](https://etcbc.github.io/bhsa/features/hebrew/c/otype). These can be utilized to isolate head words from secondary elements. A subphrase should *not* be thought of as a smaller, embedded phrase, like the ETCBC's phrase-atom (though it sometimes must indadequately fill that role). Rather, the subphrase is a way to encode relationships between words below the level of a phrase(atom), hence \"sub.\" A subphrase can be a single word, or it can be a collection of words. A word can be in multiple subphrases, but can not be in more than 3 (due to the limitations of the data creation program, [parsephrases](http://www.etcbc.nl/datacreation/#ps3.p)).\n",
    "\n",
    "## Method\n",
    "The types of phrases represented in the ETCBC include `NP` (noun phrase), `VP` (verb phrase), `PrNP` (proper noun phrase), `PP` (prepositional phrase), `AdvP` (adverbial phrase), and [eight others](https://etcbc.github.io/bhsa/features/hebrew/c/typ). For some of these types, isolating the head word is a simple affair. By coordinating a word's phrase-dependent part of speech with its enclosing phrase's type, one can identify the head word. For a `VP`, that would mean simply finding the word within the phrase that has a `pdp` (phrase dependent part of speech) value of `verb`. Or for a prepositional phrase, find the word with a `pdp` of `prep`.\n",
    "\n",
    "The `NP` and `PrNP`, on the other hand, present special challenges. These phrases often contain multiple words with a modifying relation to the head noun. An example of this is the construct relation (e.g. \"Son of Jacob\"). The problem becomes particularly thorny when relations like the construct are chained together so that one is faced with the choice between multiple potential head nouns.\n",
    "\n",
    "To navigate the problem, we must use the feature [rela (relationship)](https://etcbc.github.io/bhsa/features/hebrew/c/rela) stored on `subphrase`s in addition to the `pdp` and phrase `type` features. In order to isolate the head word of a `NP`, we look for a word within the phrase that has a `pdp` value of `subs` (i.e. noun). We then obtain a list of all the `subphrase`s which contain that word using the [L.u Text-Fabric method](https://github.com/Dans-labs/text-fabric/wiki/Api#locality). We then use the list of subphrase node numbers to create a list of all subphrase relations containing the word. If the list contains *any* dependent relations, then the word is automatically excluded from being a head word and we can move on to the next candidate. One final check is required for candidate words at the level of the `phrase`: the same procedure described above for `subphrase`s must be performed for `phrase_atom` relations. This means excluding words within a `phrase_atom` with a dependent relation to another `phrase_atom` within the `phrase`. If the head of a *`phrase_atom`* is being calculated, this step is not necessary.\n",
    "\n",
    "There are only two possible `subphrase` or `phrase_atom` relations for a valid head word: `NA` or `par`/`Para` (the verb is an exception, which in a handful of cases does have a construct relation). `NA` means that no relation is reflected. The word is independent. The `par` (`subphrase`) and `Para` (`phrase_atom`) stands for parallel relations, i.e. coordinates. While coordinates are not formally the head, they are often an important part of how the grammatical and semantic relations are built. Thus we provide coordinates alongside the head noun. These words require one further test, that is, it must be verified that their mother (using the [edge feature](https://github.com/Dans-labs/text-fabric/wiki/Api#edge-features) \"[mother](https://etcbc.github.io/bhsa/features/hebrew/c/mother.html)\") is itself a head word. To do this step thus requires us to keep track of those words within the phrase which have been validated. We can do so with a simple list.\n",
    "\n",
    "## Results\n",
    "The function `get_heads` produces head word nodes on supplied phrase(atom) nodes. The results have been manually inspected for consistency.\n",
    "\n",
    "For phrase types other than the noun phrase, the results are very accurate. Some phrase types, like the conjunction phrase, do have unexpected forms. For instance, the phrase בעבור is coded as a conjunction phrase in the BHSA; in it, there is actually no word with a part of speech of \"conjunction\". These kinds of cases are easily accounted for by making exceptions in the set of acceptable parts of speech.\n",
    "\n",
    "For noun phrases, the situation is different. In the majority of cases, the results are good. But there are a handful of cases that simply cannot be addressed using the current ETCBC data model without a solution that exceeds the bounds of this current project. The reason is that the current model does not transparently encode hierarchy between phrases and embeded phrases. For instance, both phrase atoms and subphrases have *some* overlapping features. But what is the relationship of a phrase atom to a subphrase? Or, what is the relation of one subphrase to another? These are only coded implicitly in the data. In reality, there are \"subphrases\" embedded within the ETCBC's subphrases which are not even registered in the BHSA data. While phrase atoms receive type codes, subphrases do not. Yet, subphrases are \"phrases\" too, which should also have type codes. Another problem is that the precise level of embedding for the subphrases are not provided. Subphrases are presented as equal constituents, even though some subphrases are contained within others. These kinds of problems make a simple method, such as applied here, inadequate. But more importantly, they highlight the shortcomings of the ETCBC data model.\n",
    "\n",
    "The members of the ETCBC are aware of the inadequacy of that data model to represent complex phrases, and a change is in the pipeline to address it. However, it remains to be seen how long those changes might take. For now, the functions produced and modified in this NB will sufice to provide a temporary solution for those who require head words from BHSA phrases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code Development\n",
    "\n",
    "Below we experiment with the code and develop the functions that will extract the head nouns. This involves a good deal of manual inspections of the results before exporting the Text-Fabric features.\n",
    "\n",
    "The code is written immediately below. Associated questions that arise while writing or evaluating the code are contained in the subsequent section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Documentation:** <a target=\"_blank\" href=\"https://etcbc.github.io/bhsa\" title=\"{provenance of this corpus}\">BHSA</a> <a target=\"_blank\" href=\"https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html\" title=\"{CORPUS} feature documentation\">Feature docs</a> <a target=\"_blank\" href=\"https://github.com/Dans-labs/text-fabric/wiki/Bhsa\" title=\"BHSA API documentation\">BHSA API</a> <a target=\"_blank\" href=\"https://github.com/Dans-labs/text-fabric/wiki/api\" title=\"text-fabric-api\">Text-Fabric API</a>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "\n",
       "This notebook online:\n",
       "<a target=\"_blank\" href=\"http://nbviewer.jupyter.org/github/etcbc/lingo/blob/master/heads/getting_heads.ipynb\">NBViewer</a>\n",
       "<a target=\"_blank\" href=\"https://github.com/etcbc/lingo/blob/master/heads/getting_heads.ipynb\">GitHub</a>\n"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       ".verse {\n",
       "    display: flex;\n",
       "    flex-flow: row wrap;\n",
       "    direction: rtl;\n",
       "}\n",
       ".vl {\n",
       "    display: flex;\n",
       "    flex-flow: column nowrap;\n",
       "    direction: ltr;\n",
       "}\n",
       ".sentence,.clause,.phrase {\n",
       "    margin-top: -1.2em;\n",
       "    margin-left: 1em;\n",
       "    background: #ffffff none repeat scroll 0 0;\n",
       "    padding: 0 0.3em;\n",
       "    border-style: solid;\n",
       "    border-radius: 0.2em;\n",
       "    font-size: small;\n",
       "    display: block;\n",
       "    width: fit-content;\n",
       "    max-width: fit-content;\n",
       "    direction: ltr;\n",
       "}\n",
       ".atoms {\n",
       "    display: flex;\n",
       "    flex-flow: row wrap;\n",
       "    margin: 0.3em;\n",
       "    padding: 0.3em;\n",
       "    direction: rtl;\n",
       "    background-color: #ffffff;\n",
       "}\n",
       ".satom,.catom,.patom {\n",
       "    margin: 0.3em;\n",
       "    padding: 0.3em;\n",
       "    border-radius: 0.3em;\n",
       "    border-style: solid;\n",
       "    display: flex;\n",
       "    flex-flow: column nowrap;\n",
       "    direction: rtl;\n",
       "    background-color: #ffffff;\n",
       "}\n",
       ".sentence {\n",
       "    border-color: #aa3333;\n",
       "    border-width: 1px;\n",
       "}\n",
       ".clause {\n",
       "    border-color: #aaaa33;\n",
       "    border-width: 1px;\n",
       "}\n",
       ".phrase {\n",
       "    border-color: #33aaaa;\n",
       "    border-width: 1px;\n",
       "}\n",
       ".satom {\n",
       "    border-color: #aa3333;\n",
       "    border-width: 4px;\n",
       "}\n",
       ".catom {\n",
       "    border-color: #aaaa33;\n",
       "    border-width: 3px;\n",
       "}\n",
       ".patom {\n",
       "    border-color: #33aaaa;\n",
       "    border-width: 3px;\n",
       "}\n",
       ".word {\n",
       "    padding: 0.1em;\n",
       "    margin: 0.1em;\n",
       "    border-radius: 0.1em;\n",
       "    border: 1px solid #cccccc;\n",
       "    display: flex;\n",
       "    flex-flow: column nowrap;\n",
       "    direction: rtl;\n",
       "    background-color: #ffffff;\n",
       "}\n",
       ".satom.l,.catom.l,.patom.l {\n",
       "    border-left-style: dotted\n",
       "}\n",
       ".satom.r,.catom.r,.patom.r {\n",
       "    border-right-style: dotted\n",
       "}\n",
       ".satom.L,.catom.L,.patom.L {\n",
       "    border-left-style: none\n",
       "}\n",
       ".satom.R,.catom.R,.patom.R {\n",
       "    border-right-style: none\n",
       "}\n",
       ".h {\n",
       "    font-family: \"Ezra SIL\", \"SBL Hebrew\", sans-serif;\n",
       "    font-size: large;\n",
       "    direction: rtl;\n",
       "}\n",
       ".rela,.function,.typ {\n",
       "    font-family: monospace;\n",
       "    font-size: small;\n",
       "    color: #0000bb;\n",
       "}\n",
       ".sp {\n",
       "    font-family: monospace;\n",
       "    font-size: medium;\n",
       "    color: #0000bb;\n",
       "}\n",
       ".gl {\n",
       "    font-family: sans-serif;\n",
       "    font-size: small;\n",
       "    color: #aaaaaa;\n",
       "}\n",
       ".vs {\n",
       "    font-family: sans-serif;\n",
       "    font-size: small;\n",
       "    font-weight: bold;\n",
       "    color: #444444;\n",
       "}\n",
       ".nd {\n",
       "    font-family: monospace;\n",
       "    font-size: x-small;\n",
       "    color: #999999;\n",
       "}\n",
       ".hl {\n",
       "    background-color: #ffee66;\n",
       "}\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import collections, os, sys, random\n",
    "from pprint import pprint\n",
    "from tf.fabric import Fabric\n",
    "from tf.extra.bhsa import Bhsa\n",
    "\n",
    "# load Text-Fabric and data\n",
    "data_loc = ['~/github/etcbc/bhsa/tf/c']\n",
    "TF = Fabric(locations=data_loc, silent=True)\n",
    "api = TF.load('''\n",
    "              book chapter verse\n",
    "              typ pdp rela mother \n",
    "              function lex sp ls\n",
    "              ''', silent=True)\n",
    "\n",
    "F, E, T, L = api.F, api.E, api.T, api.L # TF data methods\n",
    "B = Bhsa(api, name='getting_heads', version='c') # BHSA visualizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_heads(phrase):\n",
    "    '''\n",
    "    Extracts and returns the heads of a supplied\n",
    "    phrase or phrase atom based on that phrase's type\n",
    "    and the relations reflected within the phrase.\n",
    "    \n",
    "    --input--\n",
    "    phrase(atom) node number\n",
    "    \n",
    "    --output--\n",
    "    tuple of head word node(s) \n",
    "    '''\n",
    "    \n",
    "    # mapping from phrase type to good part of speech values for heads\n",
    "    head_pdps = {'VP': {'verb'},                   # verb \n",
    "                 'NP': {'subs', 'adjv', 'nmpr'},   # noun \n",
    "                 'PrNP': {'nmpr', 'subs'},         # proper-noun \n",
    "                 'AdvP': {'advb', 'nmpr', 'subs'}, # adverbial \n",
    "                 'PP': {'prep'},                   # prepositional \n",
    "                 'CP': {'conj', 'prep'},           # conjunctive\n",
    "                 'PPrP': {'prps'},                 # personal pronoun\n",
    "                 'DPrP': {'prde'},                 # demonstrative pronoun\n",
    "                 'IPrP': {'prin'},                 # interrogative pronoun\n",
    "                 'InjP': {'intj'},                 # interjectional\n",
    "                 'NegP': {'nega'},                 # negative\n",
    "                 'InrP': {'inrg'},                 # interrogative\n",
    "                 'AdjP': {'adjv'}                  # adjective\n",
    "                } \n",
    "    \n",
    "    # get phrase-head's part of speech value and list of candidate matches\n",
    "    phrase_type = F.typ.v(phrase)\n",
    "    head_candidates = [w for w in L.d(phrase, 'word')\n",
    "                          if F.pdp.v(w) in head_pdps[phrase_type]]\n",
    "        \n",
    "    # VP with verbs require no further processing, return the head verb\n",
    "    if phrase_type == 'VP':        \n",
    "        return tuple(head_candidates)\n",
    "        \n",
    "    # go head-hunting!\n",
    "    heads = []\n",
    "    \n",
    "    for word in head_candidates:\n",
    "        \n",
    "        # gather the word's subphrase (+ phrase_atom if otype is phrase) relations\n",
    "        word_phrases = list(L.u(word, 'subphrase'))\n",
    "        word_phrases += list(L.u(word, 'phrase_atom')) if (F.otype.v(phrase) == 'phrase') else list()\n",
    "        word_relas = set(F.rela.v(phr) for phr in word_phrases) or {'NA'}\n",
    "\n",
    "        # check (sub)phrase relations for independency\n",
    "        if word_relas - {'NA', 'par', 'Para'}: \n",
    "            continue\n",
    "            \n",
    "        # check parallel relations for independency\n",
    "        elif word_relas & {'par', 'Para'} and mother_is_head(word_phrases, heads):\n",
    "            this_head = find_quantified(word) or find_attributed(word) or word\n",
    "            heads.append(this_head)\n",
    "            \n",
    "        # save all others as heads, check for quantifiers first\n",
    "        elif word_relas == {'NA'}:\n",
    "            this_head = find_quantified(word) or find_attributed(word) or word\n",
    "            heads.append(this_head)\n",
    "            \n",
    "    return tuple(sorted(set(heads)))\n",
    "            \n",
    "def mother_is_head(word_phrases, previous_heads):\n",
    "    \n",
    "    '''\n",
    "    Test and validate parallel relationships for independency.\n",
    "    Must gather the mother for each relation and check whether \n",
    "    the mother contains a head word. \n",
    "    \n",
    "    --input--\n",
    "    * list of phrase nodes for a given word (includes subphrases)\n",
    "    * list of previously approved heads\n",
    "    \n",
    "    --output--\n",
    "    boolean\n",
    "    '''\n",
    "    \n",
    "    # get word's enclosing phrases that are parallel\n",
    "    parallel_phrases = [ph for ph in word_phrases if F.rela.v(ph) in {'par', 'Para'}]\n",
    "    # get the mother for the parallel phrases\n",
    "    parallel_mothers = [E.mother.f(ph)[0] for ph in parallel_phrases] \n",
    "    # get mothers' words, by mother\n",
    "    parallel_mom_words = [set(L.d(mom, 'word')) for mom in parallel_mothers]\n",
    "    # test for head in each mother\n",
    "    test_mothers = [bool(phrs_words & set(previous_heads)) for phrs_words in parallel_mom_words] \n",
    "        \n",
    "    return all(test_mothers)\n",
    "    \n",
    "\n",
    "def find_quantified(word):\n",
    "    \n",
    "    '''        \n",
    "    Check whether a head candidate is a quantifier (e.g. כל).\n",
    "    If it is, find the quantified noun if there is one.\n",
    "    Quantifiers are connected with the modified noun\n",
    "    either by a subphrase relation of \"rec\" for nomen \n",
    "    regens. In this case, the quantifier word node is the\n",
    "    mother itself. In other cases, the noun is related to the\n",
    "    number via the \"atr\" (attributive) subphrase relation. In this\n",
    "    case, the edge relation is connected from the substantive\n",
    "    to the number's subphrase.\n",
    "    \n",
    "    --input--\n",
    "    word node\n",
    "    \n",
    "    --output--\n",
    "    new word node or None\n",
    "    '''\n",
    "    \n",
    "    custom_quants = {'KL/', 'M<V/', 'JTR/', # quantifier lexemes, others?\n",
    "                     'M<FR/', 'XYJ/'} \n",
    "    good_pdps = {'subs', # substantive\n",
    "                 'nmpr', # proper noun\n",
    "                 'prde', # demonstrative\n",
    "                 'prps', # pronoun\n",
    "                 'verb'} # \"verb\" for participles, see the inquiries below.\n",
    "    \n",
    "    if F.lex.v(word) not in custom_quants and F.ls.v(word) not in {'card', 'ordn'}:        \n",
    "        return None\n",
    "    \n",
    "    # first check rec relations for valid quantified noun:\n",
    "    rectum = next((sp for sp in E.mother.t(word) if F.rela.v(sp) == 'rec'), 0) # extract the rectum\n",
    "    noun = next((w for w in L.d(rectum, 'word') if F.pdp.v(w) in good_pdps), 0) # filter words for noun\n",
    "    num_check = F.ls.v(L.u(noun, 'lex')[0]) if noun else ''\n",
    "    if noun and num_check not in {'card', 'ordn'}:\n",
    "        return noun\n",
    "    \n",
    "    # check the adjunct relation if no rec found:\n",
    "    subphrases = sorted(L.u(word, 'subphrase'))    \n",
    "    # move progressively from smallest to largest subphrase, stop when non-cardinal noun is found  \n",
    "    for sp in subphrases:        \n",
    "        candidates = sorted(daughter for daughter in E.mother.t(sp) if F.rela.v(daughter) == 'adj')\n",
    "        for candi in candidates:\n",
    "            noun = next((w for w in L.d(candi, 'word') if F.pdp.v(w) in good_pdps), 0)\n",
    "            num_check = F.ls.v(L.u(noun, 'lex')[0]) if noun else ''\n",
    "            if noun and num_check not in {'card', 'ordn'}:                \n",
    "                return noun\n",
    "    \n",
    "    # all else are non-quantifiers\n",
    "    return None\n",
    "\n",
    "def find_attributed(word):\n",
    "    \n",
    "    '''        \n",
    "    Check whether the head candidate is an adjective.\n",
    "    If it is, retrieve its attributed noun via the\n",
    "    regens (rec) relationship.\n",
    "    \n",
    "    This function is similar to the quantified function.\n",
    "    \n",
    "    --input--\n",
    "    word node\n",
    "    \n",
    "    --output--\n",
    "    new word node or None\n",
    "    '''\n",
    "    \n",
    "    if F.pdp.v(word) != 'adjv':\n",
    "        return None\n",
    "    \n",
    "    # check rec relations for valid attributed noun:\n",
    "    rectum = next((sp for sp in E.mother.t(word) if F.rela.v(sp) == 'rec'), 0) # extract the rectum\n",
    "    noun = next((w for w in L.d(rectum, 'word') if F.pdp.v(w) == 'subs'), 0) # filter words for noun\n",
    "    if noun:\n",
    "        return noun\n",
    "    \n",
    "    # sanity check: adjectives should not \n",
    "    # pass through this algorithm without a noun assignment\n",
    "    if F.typ.v(L.u(word, 'phrase')) == 'NP':\n",
    "        raise Exception(f'adjective head assignment on NP {L.u(word, \"phrase\")} at word {word}')\n",
    "    else:\n",
    "        return None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tests\n",
    "\n",
    "Testing the get heads function and measuring the results.\n",
    "\n",
    "### Test Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_subphrases(phrase_node):\n",
    "    '''\n",
    "    Inspect subphrases and their relations to each other.\n",
    "    '''\n",
    "    print('subphrase', '\\ttext', '\\trelation', '\\tmother')\n",
    "    for sp in L.d(phrase_node, 'subphrase'):\n",
    "        print(T.text(L.d(sp, 'word')), sp, F.rela.v(sp), E.mother.f(sp))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simple Test\n",
    "Apply the function to all the phrases in the HB. Record those cases which fail to receive a head assignment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "headless = []\n",
    "total = 0\n",
    "\n",
    "for phrase in F.otype.s('phrase'):\n",
    "    \n",
    "    total += 1\n",
    "    \n",
    "    heads = get_heads(phrase)\n",
    "    \n",
    "    if not heads:\n",
    "        headless.append((phrase,))\n",
    "        \n",
    "len(headless)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quality Tests\n",
    "\n",
    "What do the results look like? Start with noun-phrases, but retrieve interesting examples with more than a few words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2668"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "examples = []\n",
    "\n",
    "for phrase in F.typ.s('NP'):\n",
    "    \n",
    "    len_words = len(L.d(phrase, 'word'))\n",
    "    \n",
    "    if len_words > 5:\n",
    "    \n",
    "        heads = get_heads(phrase)\n",
    "    \n",
    "        examples.append(heads)\n",
    "    \n",
    "len(examples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "random.shuffle(examples) # get samples at random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(examples[:20]) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Discovery\n",
    "\n",
    "The queries which follow were written at different times during the code construction for the heads algorithm.\n",
    "\n",
    "In this section, important questions were asked whose answers are needed to ensure the code is written correctly. The BHSA data is queried to answer them. These are questions like, \"Do we need to check for relational independency for only noun phrases?\" (no); and \"does every phrase type have a word with a corresponding pdp?\" (no).\n",
    "\n",
    "### Make definitions available for exploration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# mapping from phrase type to its head part of speech\n",
    "type_to_pdp = {'VP': 'verb', # verb \n",
    "               'NP': 'subs', # noun \n",
    "               'PrNP': 'nmpr', # proper-noun \n",
    "               'AdvP': 'advb', # adverbial \n",
    "               'PP': 'prep', # prepositional \n",
    "               'CP': 'conj', # conjunctive\n",
    "               'PPrP': 'prps', # personal pronoun\n",
    "               'DPrP': 'prde', # demonstrative pronoun\n",
    "               'IPrP': 'prin', # interrogative pronoun\n",
    "               'InjP': 'intj', # interjectional\n",
    "               'NegP': 'nega', # negative\n",
    "               'InrP': 'inrg', # interrogative\n",
    "               'AdjP': 'adjv'} # adjective"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test for non-NP phrases with valid pdp but invalid head\n",
    "\n",
    "These tests demonstrate that subphrase relation checks are also needed for phrase types besides noun phrases. The only valid subphrase/phrase_atom relations for any potential head word is either `NA` or `par`/`Para`. While a few phrase types do not need additional relational checks, e.g. personal pronoun phrases, we can go ahead and consistently handle all phrases in the same way.\n",
    "\n",
    "The only exception to the above rule is the `VP`, for which there are 14 cases of the `VP`'s head word (verb) that is also in a subphrase with a regens (`rec`) relation.\n",
    "\n",
    "The operational question of these tests was:\n",
    "> Are there cases in which a non-NP phrase(atom) contains a word with the corresponding pdp value, but which is probably not a head?\n",
    "\n",
    "To answer the question, we first survey all cases where the phrase type's head candidate is in a subphrase with a relation that is not normally \"independent.\" Based on the survey, we manually check the most pertinent phrase types and results. The tests reveal that, indeed, relation checks are needed for many phrase types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_pdp_safe(phrase_object='phrase_atom'):\n",
    "    \n",
    "    '''\n",
    "    Make a survey of phrase types and their matching pdp words,\n",
    "    count what kinds of subphrase relations these words \n",
    "    occurr in. The survey can then be used to investigate \n",
    "    whether phrase types besides noun phrases require relationship\n",
    "    checks for independency.\n",
    "    '''\n",
    "    \n",
    "    pdp_relas_survey = collections.defaultdict(lambda: collections.Counter())\n",
    "    headless = 0\n",
    "    \n",
    "    for phrase in F.otype.s(phrase_object):\n",
    "\n",
    "        typ = F.typ.v(phrase) # phrase type\n",
    "\n",
    "        # skip noun phrases\n",
    "        if typ in {'NP', 'PrNP'}: \n",
    "            continue\n",
    "\n",
    "        head_pdp = type_to_pdp[typ]\n",
    "\n",
    "        maybe_heads = [w for w in L.d(phrase, 'word') \n",
    "                           if F.pdp.v(w) == head_pdp]\n",
    "        \n",
    "        # this check shows that many\n",
    "        # phrases don't have a word \n",
    "        # with a corresponding pdp!\n",
    "        if not maybe_heads:\n",
    "            headless += 1\n",
    "\n",
    "        # survey the candidate heads' relations\n",
    "        for word in maybe_heads:\n",
    "\n",
    "            head_name = typ + '|' + head_pdp\n",
    "            subphrases = L.u(word, 'subphrase')\n",
    "            sp_relas = set(F.rela.v(sp) for sp in subphrases)\\\n",
    "                        if subphrases else {'NA'} # <- handle cases without any subphrases (i.e. verbs)\n",
    "\n",
    "            pdp_relas_survey[head_name].update(sp_relas)\n",
    "\n",
    "    print(f'phrases without matching pdp: {headless}\\n')\n",
    "    print('subphrase relation survey: ')\n",
    "    for name, rela_counts in pdp_relas_survey.items():\n",
    "\n",
    "        print(name)\n",
    "\n",
    "        for r, count in rela_counts.items():\n",
    "            print('\\t', r, '-', count)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "phrases without matching pdp: 837\n",
      "\n",
      "subphrase relation survey: \n",
      "PP|prep\n",
      "\t NA - 64521\n",
      "\t par - 3824\n",
      "\t adj - 42\n",
      "\t rec - 8\n",
      "VP|verb\n",
      "\t NA - 69011\n",
      "\t rec - 14\n",
      "\t par - 1\n",
      "CP|conj\n",
      "\t NA - 53859\n",
      "AdvP|advb\n",
      "\t NA - 5131\n",
      "\t par - 102\n",
      "\t mod - 49\n",
      "\t adj - 1\n",
      "AdjP|adjv\n",
      "\t NA - 1845\n",
      "\t par - 135\n",
      "\t atr - 5\n",
      "\t adj - 3\n",
      "\t rec - 1\n",
      "InjP|intj\n",
      "\t NA - 1872\n",
      "\t par - 11\n",
      "DPrP|prde\n",
      "\t NA - 790\n",
      "NegP|nega\n",
      "\t NA - 6743\n",
      "PPrP|prps\n",
      "\t NA - 4468\n",
      "\t par - 9\n",
      "IPrP|prin\n",
      "\t NA - 797\n",
      "\t par - 1\n",
      "InrP|inrg\n",
      "\t NA - 1288\n",
      "\t par - 3\n"
     ]
    }
   ],
   "source": [
    "# for phrase_atoms\n",
    "test_pdp_safe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "phrases without matching pdp: 670\n",
      "\n",
      "subphrase relation survey: \n",
      "PP|prep\n",
      "\t NA - 62315\n",
      "\t par - 3678\n",
      "\t adj - 42\n",
      "\t rec - 9\n",
      "VP|verb\n",
      "\t NA - 69011\n",
      "\t rec - 14\n",
      "\t par - 1\n",
      "CP|conj\n",
      "\t NA - 52545\n",
      "AdvP|advb\n",
      "\t NA - 5083\n",
      "\t par - 101\n",
      "\t mod - 46\n",
      "\t adj - 1\n",
      "AdjP|adjv\n",
      "\t NA - 1797\n",
      "\t par - 118\n",
      "\t atr - 5\n",
      "\t adj - 3\n",
      "\t rec - 1\n",
      "InjP|intj\n",
      "\t NA - 1872\n",
      "\t par - 11\n",
      "DPrP|prde\n",
      "\t NA - 791\n",
      "NegP|nega\n",
      "\t NA - 6743\n",
      "PPrP|prps\n",
      "\t NA - 4388\n",
      "\t par - 9\n",
      "IPrP|prin\n",
      "\t NA - 797\n",
      "\t par - 1\n",
      "InrP|inrg\n",
      "\t NA - 1288\n",
      "\t par - 3\n"
     ]
    }
   ],
   "source": [
    "# and for phrases\n",
    "test_pdp_safe(phrase_object='phrase')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "^ These surveys tell us that for several of these phrase types, e.g. `InjP`, we can automatically take the word with the pdp value that corresponds with its phrase type as the head.\n",
    "\n",
    "There are also quite a few cases where the phrase type does not have a word with a matching pdp value: 837 for phrase atoms and 670 for phrases. In the subsequent section we will run tests to find out why this is the case.\n",
    "\n",
    "Back to the question of this section: There are 14 examples of VP with verbs that have a `rec` (nomen regens) relation. Are these heads or not? We check now..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_and_show(search_pattern):\n",
    "    results = sorted(B.search(search_pattern))\n",
    "    print(len(results), 'results')\n",
    "    B.show(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run notebook locally to see HTML-formatted results for the below searches\n",
    "\n",
    "\n",
    "rec_verbs = '''\n",
    "\n",
    "phrase_atom typ=VP\n",
    "    subphrase rela=rec\n",
    "        word pdp=verb\n",
    "'''\n",
    "\n",
    "#find_and_show(rec_verbs) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In all 14 results, the verb serves as the true head word of the `VP`.\n",
    "\n",
    "*Note: The verb will prove to be an exception, as all other words in a `rec` relation are not head words*\n",
    "\n",
    "The `PP` also has some strange relations. We see what's going on with the same kind of inspection. First we look at the `rec` (regens) relations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "rec_preps = '''\n",
    "\n",
    "phrase_atom typ=PP\n",
    "    subphrase rela=rec\n",
    "        word pdp=prep\n",
    "'''\n",
    "\n",
    "#find_and_show(rec_preps) #uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The PP is different. In cases where the `phrase_atom` = `rec`, the preposition is *not* the head. Thus, the algorithm will need to check for these cases.\n",
    "\n",
    "Now for the `adj` subphrase relation in `PP`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "adj_preps = '''\n",
    "\n",
    "phrase_atom typ=PP\n",
    "    subphrase rela=adj\n",
    "        word pdp=prep\n",
    "'''\n",
    "\n",
    "#find_and_show(adj_preps) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results above show that the `adj` subphrase relation is also a non-head. These cases have to be excluded.\n",
    "\n",
    "Now we move on to test the **adverb** relations reflected in the survey..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "adv_adj = '''\n",
    "\n",
    "phrase_atom typ=AdvP\n",
    "    subphrase rela=adj\n",
    "        word pdp=advb\n",
    "\n",
    "'''\n",
    "\n",
    "#find_and_show(adv_adj) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `adj` relationships in the adverbial phrase is also not a true head. Now for the `mod` (modifier) relation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "adv_mod = '''\n",
    "\n",
    "phrase_atom typ=AdvP\n",
    "    subphrase rela=mod\n",
    "        word pdp=advb\n",
    "\n",
    "'''\n",
    "\n",
    "#find_and_show(adv_mod) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, it appears that `mod` is also an invalid relation for adverb phrases. And example is גם הלם ('also here') where גם is the adverb in `mod` relation, but the head is really הלם \"here\" (also an adverb). In several cases, the modifier modifies a verb. In these cases the \"head,\" often a participle or infinitive, acts as the adverb, even though it is not explicitly marked as such.\n",
    "\n",
    "Now we move on to the last examination, that of the `AdjP` (adjective phrase). There are three relations of interest:\n",
    "> atr - 6 <br>\n",
    "> adj - 3 <br>\n",
    "> rec - 1 <br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "adj_atr = '''\n",
    "\n",
    "phrase_atom typ=AdjP\n",
    "    subphrase rela=atr\n",
    "        word pdp=adjv\n",
    "\n",
    "'''\n",
    "\n",
    "#find_and_show(adj_atr) # uncomment me!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "adj_adj = '''\n",
    "\n",
    "phrase_atom typ=AdjP\n",
    "    subphrase rela=adj\n",
    "        word pdp=adjv\n",
    "\n",
    "'''\n",
    "\n",
    "#find_and_show(adj_adj) # uncomment me!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "adj_rec = '''\n",
    "\n",
    "phrase_atom typ=AdjP\n",
    "    subphrase rela=rec\n",
    "        word pdp=adjv\n",
    "\n",
    "'''\n",
    "\n",
    "#find_and_show(adj_rec) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results for the three searches above show indeed that the relations of `atr`, `adj`, and `rec` are not head words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tests for phrase types without a word that has a valid pdp value\n",
    "\n",
    "The initial survey above revealed that 837 phrase atoms and 670 phrases lack a word with a corresponding pdp value. Here we investigate to see why that is the case. Is there a way to compensate for this problem? Are these truly phrases that lack heads?\n",
    "\n",
    "We run another survey and count the phrase types against the non-matching pdp values found within them. At this point, we must also exclude words that have dependent relations (as defined above, subphrase values of NA or parallel)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AdvP\n",
      "\t nmpr - 253\n",
      "\t subs - 499\n",
      "\t art - 190\n",
      "\t conj - 13\n",
      "PrNP\n",
      "\t subs - 9\n",
      "\t art - 3\n",
      "CP\n",
      "\t prep - 85\n",
      "\t subs - 79\n",
      "\t advb - 6\n",
      "NP\n",
      "\t intj - 1\n"
     ]
    }
   ],
   "source": [
    "count_no_pdp = collections.defaultdict(lambda: collections.Counter())\n",
    "record_no_pdp = collections.defaultdict(lambda: collections.defaultdict(list))\n",
    "\n",
    "for phrase in F.otype.s('phrase_atom'):\n",
    "    \n",
    "    typ = F.typ.v(phrase)\n",
    "    \n",
    "    # see if there is not corresponding pdp value\n",
    "    corres_pdp = type_to_pdp[typ]\n",
    "    corresponding_pdps = [w for w in L.d(phrase, 'word') \n",
    "                             if F.pdp.v(w) == corres_pdp]\n",
    "    \n",
    "    if not corresponding_pdps:\n",
    "        \n",
    "        # put potential heads here\n",
    "        maybe_heads = []\n",
    "        \n",
    "        # calculate subphrase relations\n",
    "        for word in L.d(phrase, 'word'):\n",
    "            \n",
    "            # get subphrase relations\n",
    "            word_subphrs = L.u(word, 'subphrase')\n",
    "            sp_relas = set(F.rela.v(sp) for sp in word_subphrs) or {'NA'}\n",
    "            \n",
    "            # check subphrase relations for independence\n",
    "            if sp_relas == {'NA'}:\n",
    "                maybe_heads.append(word)\n",
    "                \n",
    "            # test parallel relation for independence\n",
    "            elif sp_relas == {'NA', 'par'} or sp_relas == {'par'}:\n",
    "                \n",
    "                # check for good, head mothers\n",
    "                good_mothers = set(sp for w in maybe_heads for sp in L.u(w, 'subphrase'))\n",
    "                this_daughter = [sp for sp in word_subphrs if F.rela.v(sp) == 'par'][0]\n",
    "                this_mother = E.mother.f(this_daughter)\n",
    "                \n",
    "                if this_mother in good_mothers:\n",
    "                    maybe_heads.append(word)\n",
    "                    \n",
    "        # sanity check\n",
    "        # maybe_heads should have SOMETHING\n",
    "        if not maybe_heads:\n",
    "            raise Exception(f'phrase {phrase} looks HEADLESS!')\n",
    "        \n",
    "        # count pdp types\n",
    "        head_pdps = [F.pdp.v(w) for w in maybe_heads]\n",
    "        count_no_pdp[typ].update(head_pdps)\n",
    "        \n",
    "        # save for examination\n",
    "        for word in maybe_heads:\n",
    "            record_no_pdp[typ][F.pdp.v(word)].append((phrase, word))\n",
    "        \n",
    "for name, counts in count_no_pdp.items():\n",
    "\n",
    "    print(name)\n",
    "\n",
    "    for pdp, count in counts.items():\n",
    "        print('\\t', pdp, '-', count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These results are a bit puzzling. The numbers here are words within the phrase atoms that have NO subphrase relations. That means, for example, words such as הַ \"the\" do not appear to have any subphrase relation to their modified nouns. That again illustrates the shortcoming of the ETCBC data in this respect. There should be a relation from the article to the determined noun.\n",
    "\n",
    "From this point forward, I will begin working through all four phrase types and the cases reflected in the survey.\n",
    "\n",
    "Beginning with the `AdvP` type and the article. Upon some initial inspection, I've found that in many of the `AdvP` with the article, there is also a substantive (`subs`) that was found by the search. Are there any cases where there is no `nmpr` or `subs` found alongside the article? We can use the dict `record_no_pdp` which has recorded all cases reflected in the survey. Below I look to see if all 190 cases of an article in these `AdvP` phrases also has a corresponding noun."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 without nouns found...\n"
     ]
    }
   ],
   "source": [
    "no_noun = []\n",
    "\n",
    "for phrase in record_no_pdp['AdvP']['art']:\n",
    "    \n",
    "    pdps = set(F.pdp.v(w) for w in L.d(phrase[0], 'word'))\n",
    "    \n",
    "    if not {'nmpr', 'subs'} & pdps:\n",
    "        no_noun.append((phrase,))\n",
    "        \n",
    "print(len(no_noun), 'without nouns found...')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There it is. So all cases of these articles can be discarded. In these cases, the noun serves as the head of the adverbial phrase. An example of this is when the noun marks the location of the action (hence adverb). \n",
    "\n",
    "Next, we check the conjunctions found in the adverbial phrases. Are any of those heads?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(record_no_pdp['AdvP']['conj']) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All conjunctions in these `AdvP` phrases function to mark coordinate elements (only ו in these results). They can also be discarded as not possible heads.\n",
    "\n",
    "Now we investigate the `PrNP` results with `subs` and `art`..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(record_no_pdp['PrNP']['subs']) # uncomment me!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(record_no_pdp['PrNP']['art']) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `art` relations reflected in the second search are not heads, but are all related to a substantive. All of the results in `subs` are heads. Thus, the only acceptable pdp for `PrNP` besides a proper noun is `subs`.\n",
    "\n",
    "Now we dig into `CP` results. 85 of them have no `pdp` of conjunction, but have a preposition instead. Let's see what's going on..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(record_no_pdp['CP']['prep'][:20]) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are very interesting results. These conjunction phrases are made up of constructions like ב+עבור and ב+טרם. Together these words function as a conjunction, but alone they are prepositions and particles. Is it even possible in this case to say that there is a \"head\"?\n",
    "\n",
    "It could be said that these combinations of words mean more than the sum of their parts; they are good examples of constructions, i.e. combinations of words whose meaning cannot be inferred simply from their individual words. Constructions illustrate the vague boundary between syntax and lexicon (cf. e.g. Goldberg, 1995, *Constructions*).\n",
    "\n",
    "While these words are indeed marked as conjunction phrases, it is better in this case to analyze them as prepositional phrases (which they also are...this is another shortcoming of our data, or perhaps a mistake??). Thus, the head is the preposition, not the prepositional object.\n",
    "\n",
    "We should expect that the remaining `subs` and `advb` groups are in fact the objects of those prepositions (and hence excluded). Let's test that assumption by looking for a preposition behind these words..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "subs|advb with no preceding prepositions: 0\n"
     ]
    }
   ],
   "source": [
    "no_prep = []\n",
    "\n",
    "for (phrase, word) in record_no_pdp['CP']['subs'] + record_no_pdp['CP']['advb']:\n",
    "    \n",
    "    possible_prep = word - 1\n",
    "    \n",
    "    if F.sp.v(possible_prep) != 'prep':\n",
    "        no_prep.append((phrase, word))\n",
    "        \n",
    "print(f'subs|advb with no preceding prepositions: {len(no_prep)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we see. We can confirm that none of the substantives or adverbs will be the head of a conjunction phrase. A preposition is the only other kind of head for the `CP` besides a conjunction itself.\n",
    "\n",
    "Finally, we're left with a last noun phrase (`NP`) for which no matching noun was found. The search found instead both `adjv` (adjective) and a `intj` (interjection). Let's see it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(record_no_pdp['NP']['intj']) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, the word אוי \"woe\" functions like a noun. This thus appears to be another mislabeled `pdp` value, since it should read `subs`. This, like the previous example, will not receive a head value due to the mistake."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieving Quantified Words\n",
    "\n",
    "When the heads algorithm looks for a noun without any subphrase relations in the phrase, it will often return a quantifier noun such as a number, e.g. שבעה \"seven\", or such as another descriptor like כל. But these words function semantically in a more descriptive role than a head role. Thus, we want our algorithm to isolate quantified nouns from their quantifiers. To do that means we must first know how the ETCBC encodes the relationship between a quantifier and the quantified noun. \n",
    "\n",
    "In a previous algorithm used for quantified extraction, we looked for a nomen regens relation on the quantifier and located the noun within the related subphrase. This approach works well for the quantifier כל. But for cardinal numbers, the relation `adj` (adjunct) is often used as well (as seen in the surveys below).\n",
    "\n",
    "To illustrate with the search below, the quantifier שבעה \"seven\" has no nomen regens relation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show([(2217,)]) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rather than reflecting a regen/rectum relation, the second word שנים \"years,\" the quantified noun, has a subphrase relation of `adj` \"adjunct\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "שֶׁ֣בַע שָׁנִ֔ים וּשְׁמֹנֶ֥ה מֵאֹ֖ות שָׁנָ֑ה \n",
      "\n",
      "1301096 (subphrase)\n",
      "\t שָׁנִ֔ים \n",
      "\t rela: adj\n",
      "\n",
      "1301097 (subphrase)\n",
      "\t שֶׁ֣בַע שָׁנִ֔ים \n",
      "\t rela: NA\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(T.text(L.d(652883, 'word')))\n",
    "print()\n",
    "\n",
    "for sp in L.u(2218, 'subphrase'): # subphrases belonging to \"years\"\n",
    "    print(sp, '(subphrase)')\n",
    "    print('\\t', T.text(L.d(sp,'word')))\n",
    "    print('\\t', 'rela:',  F.rela.v(sp))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what other kinds of subphrase relations are reflected by quantifieds.\n",
    "\n",
    "Below we make a survey of all mother-daughter relations between a quantifier subphrase and its daughters. The goal is to isolate those relationships which contain the quantified noun. We work through examples to get an idea of the meaning of the features. And we write a few TF search queries further below to confirm hypotheses about these relationships."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">XD/\n",
      "\t par - 62\n",
      "\t adj - 42\n",
      "\t rec - 69\n",
      "\t Appo - 1\n",
      "\t Spec - 6\n",
      "\t atr - 8\n",
      "\t mod - 4\n",
      "CNJM/\n",
      "\t rec - 345\n",
      "\t adj - 267\n",
      "\t par - 109\n",
      "\t mod - 14\n",
      "\t atr - 5\n",
      "\t Spec - 6\n",
      "\t dem - 3\n",
      "\t Sfxs - 1\n",
      ">RB</\n",
      "\t adj - 285\n",
      "\t par - 100\n",
      "\t rec - 113\n",
      "\t atr - 8\n",
      "\t Spec - 1\n",
      "\t mod - 1\n",
      "CB</\n",
      "\t par - 87\n",
      "\t adj - 268\n",
      "\t rec - 169\n",
      "\t mod - 1\n",
      "\t dem - 1\n",
      "\t atr - 23\n",
      "\t Spec - 4\n",
      "CLC/\n",
      "\t par - 148\n",
      "\t adj - 331\n",
      "\t rec - 175\n",
      "\t atr - 4\n",
      "\t dem - 1\n",
      "\t Spec - 1\n",
      "M>H/\n",
      "\t rec - 30\n",
      "\t adj - 302\n",
      "\t par - 217\n",
      "\t atr - 2\n",
      "CMNH/\n",
      "\t adj - 128\n",
      "\t par - 28\n",
      "\t rec - 8\n",
      "\t mod - 1\n",
      "TC</\n",
      "\t adj - 46\n",
      "\t par - 43\n",
      "\t rec - 22\n",
      "XMC/\n",
      "\t adj - 282\n",
      "\t par - 175\n",
      "\t rec - 96\n",
      "\t mod - 1\n",
      "\t Attr - 1\n",
      "\t atr - 4\n",
      "\t Spec - 2\n",
      "<FRH/\n",
      "\t adj - 96\n",
      "\t par - 16\n",
      "\t Spec - 1\n",
      "<FR=/\n",
      "\t adj - 11\n",
      "\t par - 6\n",
      "\t rec - 30\n",
      "CC/\n",
      "\t adj - 176\n",
      "\t par - 75\n",
      "\t rec - 103\n",
      "\t atr - 4\n",
      "<FRJM/\n",
      "\t adj - 209\n",
      "\t par - 149\n",
      "\t Spec - 1\n",
      "<FR/\n",
      "\t adj - 97\n",
      "\t par - 12\n",
      "\t mod - 2\n",
      "\t atr - 6\n",
      "\t rec - 2\n",
      "\t Spec - 2\n",
      "<FRH=/\n",
      "\t adj - 52\n",
      "\t rec - 49\n",
      "\t par - 10\n",
      "\t dem - 2\n",
      "\t Spec - 1\n",
      ">LP=/\n",
      "\t adj - 143\n",
      "\t rec - 33\n",
      "\t par - 198\n",
      "\t Spec - 1\n",
      "\t atr - 2\n",
      "<CTJ/\n",
      "\t adj - 13\n",
      "\t rec - 19\n",
      "\t par - 1\n",
      "RBW>/\n",
      "\t adj - 3\n",
      "\t par - 3\n",
      "XD/\n",
      "\t atr - 1\n",
      "\t Spec - 2\n",
      "CTJN/\n",
      "\t par - 1\n",
      "CT/\n",
      "TLT/\n",
      "\t adj - 2\n",
      "TRJN/\n",
      "\t rec - 2\n",
      ">LP/\n",
      "\t rec - 1\n",
      "<FRJN/\n",
      "TLTJN/\n"
     ]
    }
   ],
   "source": [
    "quant_relas = collections.defaultdict(lambda: collections.Counter())\n",
    "quant_ex = collections.defaultdict(lambda: collections.defaultdict(list))\n",
    "quants = [word for lex in F.ls.s('card') for word in L.d(lex, 'word')]\n",
    "\n",
    "for word in quants:\n",
    "    \n",
    "    subphrases = L.u(word, 'subphrase')\n",
    "    sp_daughters = [E.mother.t(sp) for sp in subphrases if E.mother.t(sp)]\n",
    "    sp_daughters += [E.mother.t(word)] if E.mother.t(word) else list()\n",
    "    sp_relas = [F.rela.v(sp[0]) for sp in sp_daughters]\n",
    "    quant_relas[F.lex.v(word)].update(sp_relas)\n",
    "    \n",
    "    for rela in sp_relas:\n",
    "        quant_ex[F.lex.v(word)][rela].append((L.u(word, 'phrase')[0], word))\n",
    "    \n",
    "for name, counts in quant_relas.items():\n",
    "\n",
    "    print(name)\n",
    "\n",
    "    for pdp, count in counts.items():\n",
    "        print('\\t', pdp, '-', count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the inspection below, it can be seen that quantified nouns are connected to their quantifier via a subphrase relation of either `adj` (adjunctive) or `rec` (regens), as mentioned at the beginning of this inquiry."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(quant_ex['CB</']['rec']) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The query below shows that the relation `par` most frequently refers to a parallel number, e.g. שבעים ושבעה \"seventy and seven\" where \"and seven\" is in a parallel relationship."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(quant_ex['CB</']['par']) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `atr` relation appears when an adjective is used to describe a quantifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(quant_ex['>XD/']['atr']) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Spec` (phrase_atom rela) are cases where a phrase atom is used to add adjectival information about the quantifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(quant_ex['>XD/']['Spec']) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `mod` relation are cases where the quantifier is modified with particles like גם or רק"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(quant_ex['CNJM/']['mod']) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `dem` relation is when a demonstrative like אלה modifies the quantifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(quant_ex['CB</']['dem']) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the analysis up to this point, there are two kinds of relations which lead back to the quantified noun: the `rec` (regens) and `adj` (adjunct) relations. What about in cases where both of these relations are present? Is there ever a case where it is ambiguous which relation contains the quantified noun?\n",
    "\n",
    "We use a TF search pattern to build a query for these cases. We look for cases that have three subphrases. The first has a word (`w1`) which is also contained in a lex object (second to bottom block) that has a `ls` (lexical set) value of `card` (cardinal number). Then we look for two other subphrases that have a relation either to the first subphrase (in the case of the `adj` rela) or the quantifier word contained in the first subphrase (in the case of a regens relation). Within `sp2` and `sp3`, we also select the first word so we can highlight it in the `B.show` below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "245"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quant_rec_adj = '''\n",
    "\n",
    "sp1:subphrase\n",
    "    w1:word\n",
    "\n",
    "sp2:subphrase rela=rec\n",
    "    =: word\n",
    "\n",
    "sp3:subphrase rela=adj\n",
    "    =: word\n",
    "\n",
    "lex ls=card\n",
    "   w2:word\n",
    "   \n",
    "w1 = w2\n",
    "w1 <mother- sp2\n",
    "sp1 <mother- sp3\n",
    "\n",
    "sp2 <: sp3\n",
    "'''\n",
    "\n",
    "quant_rec_adj = B.search(quant_rec_adj)\n",
    "\n",
    "len(quant_rec_adj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(sorted(quant_rec_adj)) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 245 cases with both relations. Based on inspection, it seems that the word in the `rec` relation is usually another quantifier. Are there cases where it is not?\n",
    "\n",
    "We apply a filter with a list comprehension to the results below to filter out cases where there is a cardinal number in `sp2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "non_card = [r for r in quant_rec_adj if F.ls.v(L.u(r[3], 'lex')[0]) != 'card']\n",
    "\n",
    "len(non_card)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(non_card) # uncomment me!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The example below illustrates a complexity here. We iterate through every subphrase in one of the phrases from the result above. We print the subphrase number, the text, the relation, and the number of the subphrase mother..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "subphrase \ttext \trelation \tmother\n",
      "שְׁתֵּֽי־צִפֳּרִ֥ים חַיֹּ֖ות טְהֹרֹ֑ות  1316539 NA ()\n",
      "שְׁתֵּֽי־צִפֳּרִ֥ים  1316535 NA ()\n",
      "שְׁתֵּֽי־ 1316533 NA ()\n",
      "צִפֳּרִ֥ים  1316534 rec (60261,)\n",
      "חַיֹּ֖ות טְהֹרֹ֑ות  1316538 adj (1316535,)\n",
      "חַיֹּ֖ות  1316536 NA ()\n",
      "טְהֹרֹ֑ות  1316537 atr (1316536,)\n",
      "עֵ֣ץ אֶ֔רֶז  1316542 par (1316539,)\n",
      "עֵ֣ץ אֶ֔רֶז  1316543 NA ()\n",
      "עֵ֣ץ  1316540 NA ()\n",
      "אֶ֔רֶז  1316541 rec (60266,)\n",
      "שְׁנִ֥י תֹולַ֖עַת  1316546 par (1316543,)\n",
      "שְׁנִ֥י תֹולַ֖עַת  1316547 NA ()\n",
      "שְׁנִ֥י  1316544 NA ()\n",
      "תֹולַ֖עַת  1316545 rec (60269,)\n",
      "אֵזֹֽב׃  1316548 par (1316547,)\n"
     ]
    }
   ],
   "source": [
    "show_subphrases(686936)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example of שְׁתֵּֽי־צִפֳּרִ֥ים \"two [of] birds\" there are is a `rec` relation between two and birds. Then there is an adjunct relation, `adj`, further describing the whole subphrase \"two birds\": חַיֹּ֖ות טְהֹרֹ֑ות \"pure beasts\". In this case, it is the `rec` relation which is the valid head, while \"pure beasts\" is a secondary description. This example illustrates that there should be a priority for the `rec` relationship. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "subphrase \ttext \trelation \tmother\n",
      "שְׁתֵּי֙ הָעֲבֹתֹ֣ת  1314251 NA ()\n",
      "שְׁתֵּי֙  1314249 NA ()\n",
      "הָעֲבֹתֹ֣ת  1314250 rec (51341,)\n",
      "הַזָּהָ֔ב  1314252 adj (1314251,)\n"
     ]
    }
   ],
   "source": [
    "show_subphrases(682231)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This other example shows the same inner-structure, as do the other 3 that we've manually inspected. This confirms indeed that priority should be given when a noun is found in the `rec` relation. Afterwards, the `adj` relation is checked.\n",
    "\n",
    "\n",
    "Finally, we want to test that there always is a quantified noun in the `adj` relation. Are there other cases, based on the findings above, where the `adj` relation does not actually contain the quantified nouns? We create a looser query than the one above to cover all cases of the `adj` relation. Then we filter the results..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "77"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "no_quants_adj = '''\n",
    "\n",
    "sp1:subphrase\n",
    "    w1:word\n",
    "\n",
    "sp2:subphrase rela=adj\n",
    "    =: word\n",
    "\n",
    "lex ls=card\n",
    "   w2:word\n",
    "   \n",
    "w1 = w2\n",
    "sp1 <mother- sp2\n",
    "'''\n",
    "\n",
    "no_quants_adj = sorted(B.search(no_quants_adj))\n",
    "no_quants_adj = [r for r in no_quants_adj if F.pdp.v(r[3]) not in {'subs', 'nmpr'}]\n",
    "\n",
    "len(no_quants_adj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(no_quants_adj[:10]) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many of the cases are due to the presence of an article or a determiner. \n",
    "\n",
    "In one case the noun that is present is a demonstrative pronoun (`prde`) אלה for which there is no further specification. For now we exclude determiners and demonstratives and consider their role afterwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "25"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "no_quants_adj = [r for r in no_quants_adj if F.pdp.v(r[3]) not in {'art', 'prde'}]\n",
    "\n",
    "len(no_quants_adj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(sorted(no_quants_adj)) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An interesting result occurrs in Micah 5:4 where the word in the adjunct position has a pdp of \"adjv\" (adjective) when it should be subs. This is a mistake in the data. In this case, the participle should be nominalized as a `subs`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "subphrase \ttext \trelation \tmother\n",
      "שִׁבְעָ֣ה רֹעִ֔ים  1379983 NA ()\n",
      "שִׁבְעָ֣ה  1379981 NA ()\n",
      "רֹעִ֔ים  1379982 adj (1379981,)\n",
      "שְׁמֹנָ֖ה נְסִיכֵ֥י אָדָֽם׃  1379988 par (1379983,)\n",
      "שְׁמֹנָ֖ה  1379984 NA ()\n",
      "נְסִיכֵ֥י אָדָֽם׃  1379987 adj (1379984,)\n",
      "נְסִיכֵ֥י  1379985 NA ()\n",
      "אָדָֽם׃  1379986 rec (300679,)\n"
     ]
    }
   ],
   "source": [
    "show_subphrases(829120)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show([(829120,)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This case will be excluded from our algorithm due to the mistake.\n",
    "\n",
    "In other cases, the first word in the `adj` related subphrase is used as an adjective in a construct relation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "subphrase \ttext \trelation \tmother\n",
      "חֲמִשָּׁ֣ה  1342693 NA ()\n",
      "חַלֻּקֵֽי־אֲבָנִ֣ים׀  1342696 adj (1342693,)\n",
      "חַלֻּקֵֽי־ 1342694 NA ()\n",
      "אֲבָנִ֣ים׀  1342695 rec (151691,)\n"
     ]
    }
   ],
   "source": [
    "show_subphrases(738097)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we print the `pdp` values of the words within the related subphrase חלקי־אבנים, we will find the \"subs\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['adjv', 'subs']"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[F.pdp.v(w) for w in L.d(1342696,'word')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are only a couple of these cases in the results. Thus, it will be safe for the algorithm to take the first `subs` pdp word that it comes across.\n",
    "\n",
    "Besides these cases, there are cases where there is no `subs` but a participle occurs as a `subs` with a pdp of `verb` due to the presence of satellite objects around the verb:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show([(898716,)]) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For these cases, the algorithm should look for cases where there is no other pdp candidate and there is a verb that has a `vt` of participle.\n",
    "\n",
    "There are also several cases where the quantified noun is a personal pronoun, as exemplified below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show([(867246,)]) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These should be added to the set of acceptable quantified heads.\n",
    "\n",
    "Below we remove the cases accounted for thus far."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "no_quants_adj = [r for r in no_quants_adj if F.pdp.v(r[3]) not in {'adjv', 'verb', 'prps'}]\n",
    "\n",
    "len(no_quants_adj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(no_quants_adj) # uncomment me "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What remains is 5 instances of quantified prepositional phrases. These are cases where the number is truly acting in a nominal capacity. In these cases, the algorithm should not return any quantified nouns since the quantifier itself semantically functions as the noun."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "no_quants_adj = [r for r in no_quants_adj if F.pdp.v(r[3]) not in {'prep'}]\n",
    "\n",
    "len(no_quants_adj)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That is all the cases in which there is not a traditional \"subs\" within the adjunct of a quantifier. The final set of acceptable `pdp` tags for a quantified noun in an `adj` related subphrase are as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "acceptable_adj_quantifieds = {'subs', # noun\n",
    "                              'nmpr', # proper noun\n",
    "                              'prde', # demonsrative \n",
    "                              'prps', # pronoun\n",
    "                              'verb'} # for participles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The queries above raise the equivalent question for `rec` related, quantified subphrases:** Are there other kinds of acceptable quantified nouns in the `rec` relationship besides `subs` and `nmpr`? We make a query and test whether we also need a similar set to the one above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "258"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rec_quants = '''\n",
    "\n",
    "sp1:subphrase\n",
    "    w1:word\n",
    "\n",
    "sp2:subphrase rela=rec\n",
    "    =: word\n",
    "\n",
    "lex ls=card\n",
    "   w2:word\n",
    "   \n",
    "w1 = w2\n",
    "w1 <mother- sp2\n",
    "'''\n",
    "\n",
    "rec_quants = sorted(B.search(rec_quants))\n",
    "\n",
    " # apply filters:\n",
    "rec_quants = [r for r in rec_quants \n",
    "                  if F.pdp.v(r[3]) not in {'subs', 'nmpr'} \n",
    "                  and F.ls.v(L.u(r[3], 'lex')[0]) != 'card'\n",
    "             ]\n",
    "\n",
    "len(rec_quants)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(rec_quants) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many of these appear to be cases where the article is in the first position followed by a `subs`. We exclude those below..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rec_quants = [r for r in rec_quants if F.pdp.v(r[3]) != 'art']\n",
    "\n",
    "len(rec_quants)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(rec_quants) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a couple cases where the demonstrative occurs in the quantified position, exemplified below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B.show([(676421,)]) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rest of the cases seem to be problematic examples of prepositions. They are problematic since they should be coded with a relation of `adj` rather than `rec`. In any case, the sets of acceptable solutions should not include the preposition, the same as with the `adj`. \n",
    "\n",
    "Based on this analysis, the `rec` quantified subphrases can utilize the same check-set as the `adj` quantifieds.\n",
    "\n",
    "#### Conclusion\n",
    "This analysis has found that quantifieds should be processed in the order of `rec` subphrase relations first. If an acceptable part of speech tag is not found within the `rec` subphrase, then the subsequent `adj` subphrase (adjunct) should be checked. In a handfull of cases, there will not be a quantified noun since the quantifier itself functions as a nominal element. \n",
    "\n",
    "For both the `rec` and `adj` related subphrases, the same `pdp` check set can be used to isolate viable heads.\n",
    "\n",
    "#### Appendix: Which Subphrase?\n",
    "\n",
    "In cases where the quantified noun is related by the subphrase rela of `adj`, to which subphrase of the quantifier will it relate? It is assumed that it would relate to the largest one..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A good solution would be to progressively move up from the smallest subphrase to the largest subphrase and check for relations on each one until it is found. That is what we follow in the algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adjective -> Subs Missed Results\n",
    "\n",
    "In the first test, several substantives are missed due to the presence of an adjectival element. Let's look at those cases and see what's going on. I have copied the phrase numbers of a few relevant examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "adj_examples = [(771933,), (799523,)]\n",
    "\n",
    "#B.show(adj_examples) # uncomment me"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "subphrase \ttext \trelation \tmother\n",
      "בֶן־ 1355711 NA ()\n",
      "אָמֹ֔וץ  1355712 rec (207817,)\n"
     ]
    }
   ],
   "source": [
    "show_subphrases(adj_examples[0][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "יְשַֽׁעְיָ֣הוּ  nmpr\n",
      "בֶן־ subs\n",
      "אָמֹ֔וץ  nmpr\n"
     ]
    }
   ],
   "source": [
    "for word in L.d(adj_examples[0][0], 'word'):\n",
    "    print(T.text([word]), F.pdp.v(word))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, the substantive is not detected by the algorithm since it is in a dependent subphrase, a construct relation, with its modifying adjective. How to extract these nouns?\n",
    "\n",
    "This is very similar to the quantifier case, where the word in the rectum is actually the head (e.g. שתי שנה \"two years\" where \"two\" is registered as the head, but the substantive \"years\" is the semantic head). This kind of relationship is differentiated from non-heads by the fact that the adjective itself is independent. Thus, in cases where the adjective is independent and has a daughter rectum subphrase, the algorithm should retrive the attributed noun.\n",
    "\n",
    "**proposed solution**: Add `adjv` to the set of acceptable `pdp` for the `NP`. Any adjectives will be processed for dependency: most will fail that test. But for the dozens of cases where the adjective does not fail, the algorithm will apply a separate check for a `rec` related subphrase which contains the true head.\n",
    "\n",
    "### Participle -> Head Missed Results\n",
    "\n",
    "Other phrases that end up headless are noun phrases that have a participle which serves as a the nominal element, but since it has satellites is coded as a \"verb\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "verb_examples = [(709010,), (711593,), (756104,)]\n",
    "\n",
    "#B.show(verb_examples) # uncomment me"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "subphrase \ttext \trelation \tmother\n",
      "\n",
      "subphrase \ttext \trelation \tmother\n",
      "\n",
      "subphrase \ttext \trelation \tmother\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for phrase in verb_examples:\n",
    "    show_subphrases(phrase[0])\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are mixed cases here due to the shortcomings of the current data model. In these cases, the participle is marked as a \"verb\" since it also has objects or descriptors. In the first example above, the noun גרה functions as the *object* of the verb. The head is מעלה. But the same logic does not hold for the second or third case. In the second case, ֹ פצוע־דכא gives an *attribute* or quality of שפכה. In the third case, מצק \"poured\", describes an attribute of נחשת \"bronze.\" Thus the opposite of example 1 is true, that is, the head noun is the attributed noun in the construct relation.\n",
    "\n",
    "Since the specific role of the noun or the verb is not specified at this lower phrase level, is there even a way to differentiate these cases?\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "709010\n",
      "וְ conj\n",
      "\n",
      "711593\n",
      "אֵ֥ין  nega\n",
      "\n",
      "756104\n",
      "בֵיתֹו֩  subs\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for phrase in verb_examples:\n",
    "    print(phrase[0])\n",
    "    for word in L.d(phrase[0], 'word'):\n",
    "        print(T.text([word]), F.pdp.v(word))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It actually appears that the database treats all 3 the same: as adjectives at the phrase-dependent part of speech level. Thus, these cases will receive the same treatments as the adjective cases above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### KL/ relation problems\n",
    "\n",
    "I found an instance in Number 3:15 where the subphrase relationship that connects כל with its quantified noun is \"atr.\" That is probably wrong. Are there other cases with the same problem? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kl_prob = '''\n",
    "\n",
    "sp1:subphrase\n",
    "    w1:word lex=KL/ st=a\n",
    "\n",
    "sp2:subphrase rela=atr\n",
    "\n",
    "sp2 -mother> sp1\n",
    "sp2 >> sp1\n",
    "w1 :: sp1\n",
    "'''\n",
    "\n",
    "kl_prob = sorted(B.search(kl_prob))\n",
    "\n",
    "len(kl_prob)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(kl_prob) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that the adjectives are not nominalized in this construction as pdp of `subs`. Most of the findings are adjectives in construct with כל. But there are several cases of the participle also.\n",
    "\n",
    "Is this encoding correct?\n",
    "\n",
    "If the `rela` code were properly `rec` as most are, then this would simply be a matter of adding an additional acceptable `pdp` to the list within the get_quantified function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#kl_prob = [r for r in kl_prob if not {'adjv'} and set(F.pdp.v(w) for w in L.d(r[2]))]\n",
    "\n",
    "len(kl_prob)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kl_prob = set(r[0] for r in kl_prob)\n",
    "\n",
    "len(kl_prob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Subphrase by Subphrase Approach?\n",
    "\n",
    "Experimenting with switching from a word-by-word approach to a subphrase-by-subphrase. The first iteration of the get_heads function iterated word by word to identify valid heads with independent subphrase relations. A more efficient, and methodologically sound approach would be to work from the subphrase down to the word. Here I experiment with such a method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 193,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_phrases = [ph for ph in F.typ.s('NP') \n",
    "                    if len(L.d(ph, 'word')) == 5 \n",
    "                    and F.otype.v(ph) == 'phrase']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 194,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "655731"
      ]
     },
     "execution_count": 194,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test = test_phrases[20]\n",
    "\n",
    "test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 195,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "subphrase \ttext \trelation \tmother\n",
      "יְלִ֥יד בֵּֽיתְךָ֖  1302879 NA ()\n",
      "יְלִ֥יד  1302877 NA ()\n",
      "בֵּֽיתְךָ֖  1302878 rec (7530,)\n",
      "מִקְנַ֣ת כַּסְפֶּ֑ךָ  1302882 par (1302879,)\n",
      "מִקְנַ֣ת  1302880 NA ()\n",
      "כַּסְפֶּ֑ךָ  1302881 rec (7533,)\n"
     ]
    }
   ],
   "source": [
    "show_subphrases(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 196,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1302879, 1302877, 1302880]"
      ]
     },
     "execution_count": 196,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "head_cands = [sp for sp in L.d(test, 'subphrase') if F.rela.v(sp) == 'NA']\n",
    "\n",
    "head_cands"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note above that the heads are those within NA relations that consist of single words. How consistent is this? Are there any cases where the head does not receive its own individual subphrase with a NA relation? Or are there cases of NA relations of non-head elements? Below we run a couple of tests, and then we build a primitive head finder based on this hypothesis in order to manually inspect what happens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "example found: \n",
      "23 (1300568,) {'rec'}\n",
      "search complete with 0 results\n"
     ]
    }
   ],
   "source": [
    "for word in F.otype.s('word'):\n",
    "    \n",
    "    subphrases = L.u(word, 'subphrase')\n",
    "    \n",
    "    if not subphrases:\n",
    "        continue\n",
    "        \n",
    "    sp_relas = set(F.rela.v(sp) for sp in subphrases)\n",
    "    \n",
    "    if not {'NA', 'par'} & sp_relas:\n",
    "        print('example found: ')\n",
    "        print(word, subphrases, sp_relas)\n",
    "        break\n",
    "        \n",
    "print('search complete with 0 results')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'תְהֹ֑ום '"
      ]
     },
     "execution_count": 202,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "T.text(L.d(1300568, 'word'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 224,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "45161"
      ]
     },
     "execution_count": 224,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "no_na = '''\n",
    "\n",
    "sp1:subphrase\n",
    "    w1:word\n",
    "sp2:subphrase\n",
    "\n",
    "sp2 -mother> w1\n",
    "'''\n",
    "\n",
    "no_na = sorted(S.search(no_na))\n",
    "\n",
    "len(no_na)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 229,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "words with construct relation and no NA subphrase: 0\n"
     ]
    }
   ],
   "source": [
    "no_na_filtered = []\n",
    "\n",
    "for r in no_na:\n",
    "    \n",
    "    reg = r[1]\n",
    "    \n",
    "    reg_subphrases = L.u(reg, 'subphrase')\n",
    "    reg_sp_relas = set(F.rela.v(sp) for sp in reg_subphrases)\n",
    "    \n",
    "    if 'NA' not in reg_sp_relas:\n",
    "        no_na_filtered.append(r)\n",
    "        \n",
    "print(f'words with construct relation and no NA subphrase: {len(no_na_filtered)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The search above shows that in any case that a word is in a construct relation with a subphrase, a NA (no relation) subphrase exists. \n",
    "\n",
    "Let's broaden the inquiry a bit. What are the specific situations in which there is NO non-related subphrase at all. What kinds of relations are present? What kinds of phrases are they?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 236,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Counter({'NO subphrases': 215258, 'has NA': 37952})\n"
     ]
    }
   ],
   "source": [
    "na_survey = collections.Counter()\n",
    "\n",
    "for phrase in F.otype.s('phrase'):\n",
    "    \n",
    "    subphrase_relas = tuple(sorted(set(F.rela.v(sp) for sp in L.d(phrase, 'subphrase'))))\n",
    "    \n",
    "    if not subphrase_relas:\n",
    "        na_survey['NO subphrases'] += 1\n",
    "        \n",
    "    elif 'NA' in subphrase_relas:\n",
    "        na_survey['has NA'] += 1\n",
    "        \n",
    "    else:\n",
    "        na_survey[subphrase_relas] += 1\n",
    "    \n",
    "pprint(na_survey)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This count shows that there are only two situations in the data: either \n",
    "\n",
    "1) a phrase has no subphrases present, or \n",
    "\n",
    "2) it has a subphrase with a relation of \"NA\". There are NO cases of phrases that lack an NA subphrase but have other relations. That is good for our hypothesis...\n",
    "\n",
    "In the experiment below, two important assumptions are made about the head:\n",
    "\n",
    "**First**, it is assumed that **the head is the first valid `pdp` word in the phrase**, with the exception of quantifieds and attributed nouns which are handled differently. \n",
    "\n",
    "**Second**, it is assumed that the **first NA-relation subphrase contains the head**. We test that assumption by manually inspecting the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 292,
   "metadata": {},
   "outputs": [],
   "source": [
    "def primitive_head_hunter(phrase):\n",
    "    \n",
    "    '''\n",
    "    Looks at noun phrases for heads.\n",
    "    '''\n",
    "    \n",
    "    good_pdp = {'subs', 'nmpr'}\n",
    "    \n",
    "    subphrase_candidates = [sp for sp in L.d(phrase, 'subphrase') \n",
    "                                if F.rela.v(sp) == 'NA'\n",
    "                                and F.rela.v(L.u(sp, 'phrase_atom')[0]) == 'NA'\n",
    "                           ]\n",
    "    \n",
    "    # handle simple phrases\n",
    "    if not subphrase_candidates:\n",
    "        head_candidates = [w for w in L.d(phrase, 'word') if F.pdp.v(w) in good_pdp]\n",
    "        try:\n",
    "            return (head_candidates[0],)\n",
    "        except:\n",
    "            print(f'exception at {phrase}')\n",
    "    \n",
    "    # attempt simple head assignment\n",
    "    first_na_subphrase = subphrase_candidates[0]\n",
    "    try:\n",
    "        the_head = next(w for w in L.d(first_na_subphrase, 'word') if F.pdp.v(w) in good_pdp)\n",
    "        return (the_head,)\n",
    "    except:\n",
    "        if F.pdp.v(L.d(first_na_subphrase, 'word')[0]) == 'adjv':\n",
    "            pass\n",
    "        else:\n",
    "            raise Exception(phrase)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 296,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_results = [primitive_head_hunter(ph) for ph in test_phrases]\n",
    "\n",
    "random.shuffle(test_results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 298,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(test_results) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As it turns out, the assumption about NA phrase type is workable. But the complications of this approach (explained below) make it an unlikely solution for now."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Conclusion\n",
    "\n",
    "I've done some initial testing with the subphrase by subphrase approach. It is a promising method, but requires a more complicated implementation with nested searches through each level of the phrase hierarchy. A simple subphrase by subphrase approach is not sufficient—one needs to go phrase by phrase, phrase_atom by phrase_atom, subphrase by subphrase, and even beyond. It is a recursive problem that cannot be navigated with the present, limited data model. There is more to say about the present state of the data model which I will save for the final report.\n",
    "\n",
    "At present, the word-by-word approach provides an elegant (though limited) solution that is able to navigate the quirks of the present data model and provide an acceptable level of accuracy, with some exceptions for more complicated phrase constructions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling Parallels\n",
    "\n",
    "What is the best way to handle parallel head elements? In general, a phrase has only one real \"head\". That is, often the first head element determines the grammatical gender or number of the verb (thanks to Constantijn Sikkel for this conversation). Yet, the nouns which are coordinate to the head are often of interest for both grammatical and semantic studies.\n",
    "\n",
    "There are two approaches to collecting coordinate heads. One is to check for every word with a relation of \"parallel\" whether its mother is already established as a head. Another approach is to recursively search for nouns that are coordinate with the head word. Up until this inquiry, I have opted for option 1 due to the complexity of checking necessary relationships for a head candidate. But a phrase in Deuteromoy 12:17 is then missed by this current approach, since there is there a chain of head nouns in construct with the quantifier מעשר \"tenth\". These cases are missed. \n",
    "\n",
    "It is possible to edit the algorithm to accomodate these cases. But the example raises the broader question of whether option 1 is truly sufficient and methodologically sound. In this section, I test whether option 2 is a better alternative. First, we are unsure about how to separate a head word from a larger, paralleled subphrase. In option 1, individual words are tested, each for dependent relationships. But option 2 will go the opposite direction: beginning at the subphrase level and working down to the word. Does this affect our ability to separate the head noun of the phrase?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {},
   "outputs": [],
   "source": [
    "def OLD_get_heads(phrase):\n",
    "    '''\n",
    "    Extracts and returns the heads of a supplied\n",
    "    phrase or phrase atom based on that phrase's type\n",
    "    and the relations reflected within the phrase.\n",
    "    \n",
    "    --input--\n",
    "    phrase(atom) node number\n",
    "    \n",
    "    --output--\n",
    "    tuple of head word node(s) \n",
    "    '''\n",
    "    \n",
    "    # mapping from phrase type to good part of speech values for heads\n",
    "    head_pdps = {'VP': {'verb'},                   # verb \n",
    "                 'NP': {'subs', 'adjv', 'nmpr'},   # noun \n",
    "                 'PrNP': {'nmpr', 'subs'},         # proper-noun \n",
    "                 'AdvP': {'advb', 'nmpr', 'subs'}, # adverbial \n",
    "                 'PP': {'prep'},                   # prepositional \n",
    "                 'CP': {'conj', 'prep'},           # conjunctive\n",
    "                 'PPrP': {'prps'},                 # personal pronoun\n",
    "                 'DPrP': {'prde'},                 # demonstrative pronoun\n",
    "                 'IPrP': {'prin'},                 # interrogative pronoun\n",
    "                 'InjP': {'intj'},                 # interjectional\n",
    "                 'NegP': {'nega'},                 # negative\n",
    "                 'InrP': {'inrg'},                 # interrogative\n",
    "                 'AdjP': {'adjv'}                  # adjective\n",
    "                } \n",
    "    \n",
    "    # get phrase-head's part of speech value and list of candidate matches\n",
    "    phrase_type = F.typ.v(phrase)\n",
    "    head_candidates = [w for w in L.d(phrase, 'word')\n",
    "                          if F.pdp.v(w) in head_pdps[phrase_type]]\n",
    "        \n",
    "    # VP with verbs require no further processing, return the head verb\n",
    "    if phrase_type == 'VP':        \n",
    "        return tuple(head_candidates)\n",
    "        \n",
    "    # go head-hunting!\n",
    "    heads = []\n",
    "    \n",
    "    for word in head_candidates:\n",
    "        \n",
    "        # gather the word's subphrase (+ phrase_atom if otype is phrase) relations\n",
    "        word_phrases = list(L.u(word, 'subphrase'))\n",
    "        word_phrases += list(L.u(word, 'phrase_atom')) if (F.otype.v(phrase) == 'phrase') else list()\n",
    "        word_relas = set(F.rela.v(phr) for phr in word_phrases) or {'NA'}\n",
    "\n",
    "        # check (sub)phrase relations for independency\n",
    "        if word_relas - {'NA', 'par', 'Para'}: \n",
    "            continue\n",
    "            \n",
    "        # check parallel relations for independency\n",
    "        elif word_relas & {'par', 'Para'} and mother_is_head(word_phrases, heads):\n",
    "            this_head = find_quantified(word) or find_attributed(word) or word\n",
    "            heads.append(this_head)\n",
    "            \n",
    "        # save all others as heads, check for quantifiers first\n",
    "        elif word_relas == {'NA'}:\n",
    "            this_head = find_quantified(word) or find_attributed(word) or word\n",
    "            heads.append(this_head)\n",
    "            \n",
    "    return tuple(sorted(set(heads)))\n",
    "\n",
    "def mother_is_head(word_phrases, previous_heads):\n",
    "    \n",
    "    '''\n",
    "    Test and validate parallel relationships for independency.\n",
    "    Must gather the mother for each relation and check whether \n",
    "    the mother contains a head word. \n",
    "    \n",
    "    --input--\n",
    "    * list of phrase nodes for a given word (includes subphrases)\n",
    "    * list of previously approved heads\n",
    "    \n",
    "    --output--\n",
    "    boolean\n",
    "    '''\n",
    "    \n",
    "    # get word's enclosing phrases that are parallel\n",
    "    parallel_phrases = [ph for ph in word_phrases if F.rela.v(ph) in {'par', 'Para'}]\n",
    "    # get the mother for the parallel phrases\n",
    "    parallel_mothers = [E.mother.f(ph)[0] for ph in parallel_phrases] \n",
    "    # get mothers' words, by mother\n",
    "    parallel_mom_words = [set(L.d(mom, 'word')) for mom in parallel_mothers]\n",
    "    # test for head in each mother\n",
    "    test_mothers = [bool(phrs_words & set(previous_heads)) for phrs_words in parallel_mom_words] \n",
    "        \n",
    "    return all(test_mothers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many subphrases with a parallel relation to a validated head consist of more than one word?\n",
    "\n",
    "We take the first head element for every noun phrase and check its parallel elements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "length: 1\n",
      "\t 3346\n",
      "length: 2\n",
      "\t 686\n",
      "length: 3\n",
      "\t 86\n",
      "length: 4\n",
      "\t 24\n",
      "length: 9\n",
      "\t 10\n",
      "length: 5\n",
      "\t 4\n",
      "length: 6\n",
      "\t 3\n"
     ]
    }
   ],
   "source": [
    "par_word_count = collections.Counter()\n",
    "par_word_list = collections.defaultdict(list)\n",
    "\n",
    "for np in F.typ.s('NP'):\n",
    "    \n",
    "    heads = OLD_get_heads(np)\n",
    "    \n",
    "    if not heads:\n",
    "        continue\n",
    "    \n",
    "    the_head = heads[0]\n",
    "        \n",
    "    if not L.u(the_head, 'subphrase'):\n",
    "        continue\n",
    "        \n",
    "    head_smallest_sp = sorted(sp for sp in L.u(the_head, 'subphrase'))[0]\n",
    "    \n",
    "    par_daughter = [d for d in E.mother.t(head_smallest_sp) if F.rela.v(d) == 'par']\n",
    "    \n",
    "    for pd in par_daughter:\n",
    "        \n",
    "        word_length = len(L.d(par_daughter[0], 'word'))\n",
    "        \n",
    "        par_word_count[word_length] += 1\n",
    "        par_word_list[word_length].append((the_head, head_smallest_sp))\n",
    "        \n",
    "for w_count, count in par_word_count.items():\n",
    "    print('length:', w_count)\n",
    "    print('\\t', count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see some of the larger cases..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B.show(par_word_list[6]) # uncomment me"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1409908 par אֹצְרֹ֥ות מַאֲכָ֖ל וְשֶׁ֥מֶן וָיָֽיִן׃ \n"
     ]
    }
   ],
   "source": [
    "ex_subphrase = par_word_list[6][1][1]\n",
    "\n",
    "for daughter in [d for d in E.mother.t(ex_subphrase) if F.rela.v(d) == 'par']:\n",
    "    \n",
    "    print(daughter, F.rela.v(daughter), T.text(L.d(daughter, 'word')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now some examples of 2 word lengths..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(par_word_list[2][:5]) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These examples raise an important possibility. If we take the first word labeled \"subs\" (substantive) within the parallel, will that give us the coordinate head?\n",
    "\n",
    "Below is an example from Genesis 5:3 which shows a potential pitfall of the method 2 approach, and even of the current approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 238,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "652845 שְׁלֹשִׁ֤ים וּמְאַת֙ שָׁנָ֔ה \n",
      "subphrase \ttext \trelation \tmother\n",
      "שְׁלֹשִׁ֤ים  1301065 NA ()\n",
      "מְאַת֙ שָׁנָ֔ה  1301068 par (1301065,)\n",
      "מְאַת֙  1301066 NA ()\n",
      "שָׁנָ֔ה  1301067 rec (2154,)\n"
     ]
    }
   ],
   "source": [
    "example = L.d(T.nodeFromSection(('Genesis', 5, 3)), 'phrase')[3]\n",
    "\n",
    "print(example, T.text(L.d(example,'word')))\n",
    "show_subphrases(example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The example above illustrates the shortcoming of even the current method of separating quantifiers and quantifieds, as seen in this result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'שְׁלֹשִׁ֤ים |שָׁנָ֔ה '"
      ]
     },
     "execution_count": 144,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'|'.join(T.text([h]) for h in get_heads(example))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The algorithm retrieves both \"thirty\" and \"year,\" even though the only head in this case is \"year\". This is a shortcoming of the quantifier function, which in this case has not detected a complex quantifier that is formed with a parallel relation.\n",
    "\n",
    "The quantifier algorithm should have passed שלשים along to another test before validating it as a head. That is, it should look for this case of a complex quantifier. This is actually another good reason to change the parallels finder to approach 2, so that parallels are processed at the head level rather than disconnected from it. In this setup, the algorithm will gather all parallels to the head. If the syntactic head is a quantifier. If it has no quantified noun, then the algorithm will look further at the parallel relationship to see if it is also a quantifier. If it is, then it will look to find that quantifier's substantive and return it instead. This is a complex recursive process that will have to be coded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [],
   "source": [
    "#B.show(par_word_list[3][:15]) # uncomment me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Are there cases where there is multiple coordinate relations with a single subphrase?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 160,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_multiple_coor = '''\n",
    "\n",
    "sp1:subphrase\n",
    "sp2:subphrase rela=par\n",
    "sp3:subphrase rela=par\n",
    "\n",
    "sp2 -mother> sp1\n",
    "sp3 -mother> sp1\n",
    "\n",
    "sp2 # sp3\n",
    "'''\n",
    "\n",
    "test_multiple_coor = sorted(S.search(test_multiple_coor))\n",
    "\n",
    "len(test_multiple_coor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No, it does not happen. Thus, coordinate relations are chained to each other, not multiplied to a single mother."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Conclusions\n",
    "This inquiry sparked the one above it about the subphrase by subphrase approach. We have decided to table this method for now."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}