{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img align=\"right\" src=\"images/dans-small.png\"/>\n",
    "<img align=\"right\" src=\"images/tf-small.png\"/>\n",
    "<img align=\"right\" src=\"images/etcbc.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Trees - for BHSA data (Hebrew)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## About\n",
    "\n",
    "This notebook composes syntax trees out of the\n",
    "[BHSA](https://etcbc.github.io/bhsa/) dataset of the Hebrew Bible, its text and it linguistic annotations.\n",
    "\n",
    "The source data is the\n",
    "[text-fabric](https://github.com/Dans-labs/text-fabric/wiki) representation of this dataset.\n",
    "\n",
    "The result is a set of roughly 65,000 tree structures, one for each sentence, in\n",
    "[Penn Treebank notation](https://en.wikipedia.org/wiki/Treebank), like this\n",
    "\n",
    "```\n",
    "(S (NP (NNP John))\n",
    "   (VP (VPZ loves)\n",
    "       (NP (NNP Mary)))\n",
    "   (. .))\n",
    "```\n",
    "\n",
    "but not necessarily adhering to any known tag set.\n",
    "Note that the tags generated by this notebook can be can be customized.\n",
    "\n",
    "Trees in this format can be searched by e.g.\n",
    "[tgrep](https://web.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html).\n",
    "\n",
    "We create a new feature, `tree`, that holds these the tree structure for each sentence.\n",
    "In this way the trees are readily available when you need them in your Text-Fabric workflow.\n",
    "\n",
    "See the [example](example.ipynb) notebook.\n",
    "\n",
    "This notebook can also be used to generate text files with all trees in it.\n",
    "For that you have to run the notebook on your own computer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BHSA data and syntax trees"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The process of tree construction is not straightforward,\n",
    "since the BHSA data have not been coded as syntax trees.\n",
    "Rather they take the shape of a collection of features that describe\n",
    "observable characteristics of the words, phrases, clauses and sentences.\n",
    "Moreover, if a phrase, clause or sentence is discontinuous,\n",
    "it is divided in `phrase_atoms`, `clause_atoms`,\n",
    "or `sentence_atoms`, respectively, which are by definition continuous.\n",
    "\n",
    "There are no explicit hierarchical relationships between these objects, or rather *nodes*.\n",
    "But there is an implicit hierarchy: *embedding*.\n",
    "Every *node* is linked to the set of word nodes, these are the words that are \"contained\" in\n",
    "the node.\n",
    "This induces a containment relationship on the set of nodes.\n",
    "\n",
    "This notebook makes use of a Python module `tree.py` (in the same directory).\n",
    "This module works on top of Text-Fabric and knows the general structure of an ancient text.\n",
    "It constructs a hierarchy of words, subphrases, phrases, clauses and sentences\n",
    "based on the embedding relationship.\n",
    "\n",
    "But this is not all.\n",
    "The BHSA data contains a *mother* relationship,\n",
    "which denotes linguistic dependency.\n",
    "The module `trees.py` reconstructs the tree obtained from the embedding relationship\n",
    "by using the mother relationship as a set of instructions to move certain nodes below others.\n",
    "In some cases extra nodes will be constructed as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The embedding relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nodes:\n",
    "The BHSA data is coded in such a way that every node is associated with a *type* and a *slot set*.\n",
    "\n",
    "The *type* of a node, $T(O)$, determines which features a node has.\n",
    "BHSA types are `sentence`, `sentence_atom`,\n",
    "`clause`, `clause_atom`, `phrase`, `phrase_atom`, `subphrase`, `word`, and there are also\n",
    "the non-linguistic types `book`, `chapter`, `verse` and `half_verse`.\n",
    "\n",
    "There is an implicit *ordering of node types*, given by the sequence above, where `word` comes first and\n",
    "`sentence` comes last. We denote this ordering by $<$.\n",
    "\n",
    "The *slot set* of a node, $m(O)$, is the set of word occurrences linked to that node.\n",
    "Every word occurrence in the source occupies a unique slot, which is a number, so slot sets are sets of numbers.\n",
    "Think of the slots as the textual positions of individual words throughout the whole text.\n",
    "\n",
    "Note that when a sentence contains a clause which contains a phrase,\n",
    "the sentence, clause, and phrase are linked to slot sets that contain each other.\n",
    "The fact that a sentence \"contains\" a clause is not marked directly,\n",
    "it is a consequence of how the slot sets they are linked to are embedded.\n",
    "\n",
    "### Definition (slot set order):\n",
    "There is a\n",
    "[natural order](https://github.com/Dans-labs/text-fabric/wiki/Api#sorting-nodes)\n",
    "on slot sets, which we will use.\n",
    "\n",
    "We will not base our trees on *all* node types,\n",
    "since in the BHSA data they do not constitute a single hierarchy.\n",
    "We will restrict ourselves to the set $\\cal O = \\{$ ``sentence``, ``clause``, ``phrase``, ``word`` $\\}$.\n",
    "\n",
    "### Definition (directly below):\n",
    "Node type $T_1$\n",
    "is *directly below*\n",
    "$T_2$ ( $T_1 <_1 T_2 $ ) in $\\cal O$\n",
    "if $T_1 < T_2$\n",
    "and there is no $T$ in $\\cal O$ with\n",
    "$T_1 < T < T_2$.\n",
    "\n",
    "Now we can introduce the notion of (tree) parent with respect to a set of node types $\\cal O$\n",
    "(e.g. ):\n",
    "\n",
    "### Definition (parent)\n",
    "Node $A$ is a parent of node $B$ if the following are true:\n",
    "1. $m(A) \\subseteq\\ m(B)$\n",
    "2. $T(A) <_1 T(B)$ in $\\cal O$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The mother relationship"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While using the embedding got us trees,\n",
    "using the mother relationship will give us more interesting trees.\n",
    "In general, the *mother* in the BHSA dataset points to a node\n",
    "on which the node in question is, in some sense, dependent.\n",
    "The nature of this dependency is coded in a specific feature on clauses,\n",
    "the `clause_constituent_relation` in version 3,\n",
    "or [rela](https://etcbc.github.io/bhsa/features/hebrew/2017/rela.html) in later versions.\n",
    "\n",
    "Later in this notebook we'll show a frequency list of the values.\n",
    "\n",
    "Here is a description of what we do with the mother relationship.\n",
    "\n",
    "If a *clause* has a mother, there are three cases for the `rela` value of this clause:\n",
    "\n",
    "1. its value is in $\\{$ ``Adju``, ``Objc``, ``Subj``, ``PrAd``, ``PreC``, ``Cmpl``, ``Attr``, ``RgRc``, ``Spec`` $\\}$\n",
    "2. its value is ``Coor``\n",
    "3. its value is in $\\{$ ``Resu``, ``ReVo``, ``none`` $\\}$\n",
    "\n",
    "In case 3 we do nothing.\n",
    "\n",
    "In case 1 we remove the link of the clause to its parent\n",
    "and add the clause as a child to either the node\n",
    "that the mother points to, or to the parent of the mother.\n",
    "We do the latter only if the mother is a word.\n",
    "We will not add children to words.\n",
    "\n",
    "In the diagrams, the red arrows represent the mother relationship,\n",
    "and the black arrows the embedding relationships,\n",
    "and the fat black arrows the new parent relationships.\n",
    "The gray arrows indicated severed parent links.\n",
    "\n",
    "<img src=\"images/TreesCase1.png\"/>\n",
    "\n",
    "In case 2 we create a node between the mother and its parent.\n",
    "This node takes the name of the mother, and the mother will be added as child,\n",
    "but with name ``Ccoor``, and the clause which points to the mother is added as a sister.\n",
    "\n",
    "This is a rather complicated case, but the intuition is not that difficult.\n",
    "Consider the sentence:\n",
    "\n",
    "    John thinks that Mary said it and did it\n",
    "\n",
    "We have a compound object sentence, with ``Mary said it`` and ``did it`` as coordinated components.\n",
    "The way this has been marked up in the BHSA database is as follows:\n",
    "\n",
    "``Mary said it``, clause with ``clause_constituent_relation``=``Objc``, ``mother``=``John thinks``(clause)\n",
    "\n",
    "``and did it``, clause with ``clause_constituent_relation``=``Coor``, ``mother``=``Mary said it``(clause)\n",
    "\n",
    "So the second coordinated clause is simply linked to the first coordinated clause.\n",
    "Restructuring means to create a parent for both coordinated clauses\n",
    "and treat both as sisters at the same hierarchical level.\n",
    "See the diagram.\n",
    "\n",
    "<img src=\"images/TreesCase2.png\"/>\n",
    "\n",
    "### Note on order\n",
    "When we add nodes to new parents, we let them occupy the sequential position\n",
    "among its new sisters that corresponds with the slot set ordering.\n",
    "\n",
    "### Note on discontinuity\n",
    "Sentences, clauses and phrases are not always continuous.\n",
    "Before restructuring it will not always be the case that if you\n",
    "walk the tree in pre-order, you will end up with the leaves (the words)\n",
    "in the same order as the original sentence.\n",
    "Restructuring generally improves that, because it often puts\n",
    "a node under a non-continuous parent object precisely at the location\n",
    "that corresponds with the a gap in the parent.\n",
    "\n",
    "However, there is no guarantee that every discontinuity will be resolved in this graceful manner.\n",
    "When we create the trees, we also output the list of slot numbers\n",
    "that you get when you walk the tree in pre-order.\n",
    "Whenever this list is not monotonic, there is an issue with the ordering.\n",
    "\n",
    "### Note on cycles\n",
    "If a mother points to itself or a descendant of itself, we have a cycle in the mother relationship.\n",
    "In these cases, the restructuring algorithm will disconnect a parent link\n",
    "without introducing a new link to the tree above it:\n",
    "a whole fragment of the tree becomes disconnected and will get lost.\n",
    "\n",
    "Sanity check 6 below reveals that this occurs in fact 4 times in the BHSA version 4\n",
    "(it occurred 13 times in the BHSA 3 version).\n",
    "We will exclude these trees from further processing.\n",
    "\n",
    "### Note on stretch\n",
    "If a mother points outside the sentence of the clause\n",
    "on which it is specified we have a case of stretch.\n",
    "This should not happen. Mothers may point outside their sentences,\n",
    "but not in the cases that trigger restructuring.\n",
    "Yet, the sanity checks below reveal that this does occur in some versions.\n",
    "We will exclude these cases from further processing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Customization\n",
    "\n",
    "There are several levels of customization.\n",
    "\n",
    "If you want to apply this notebook to other resources than the BHSA,\n",
    "you can specify the names of the relevant node types and features that the\n",
    "tree algorithm may uses for structuring the trees and restructuring them (if needed).\n",
    "\n",
    "If you want to change the tags of the nodes in the output, you can write a new function `getTag()`\n",
    "and pass that to the tree algorithm. See below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import sys\n",
    "import os\n",
    "import collections\n",
    "import yaml\n",
    "\n",
    "from tf.fabric import Fabric\n",
    "from tf.core.helpers import formatMeta\n",
    "\n",
    "from tree import Tree\n",
    "import utils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pipeline\n",
    "See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)\n",
    "for how to run this script in the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "if \"SCRIPT\" not in locals():\n",
    "    SCRIPT = False\n",
    "    FORCE = True\n",
    "    CORE_NAME = \"bhsa\"\n",
    "    NAME = \"trees\"\n",
    "    VERSION = \"2021\"\n",
    "\n",
    "\n",
    "def stop(good=False):\n",
    "    if SCRIPT:\n",
    "        sys.exit(0 if good else 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook can run a lot of tests and create a lot of examples.\n",
    "However, when run in the pipeline, we only want to create the two `tree` features.\n",
    "\n",
    "So, further on, there will be quite a bit of code under the condition `not SCRIPT`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setting up the context: source file and target directories\n",
    "\n",
    "The conversion is executed in an environment of directories, so that sources, temp files and\n",
    "results are in convenient places and do not have to be shifted around."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "repoBase = os.path.expanduser(\"~/github/etcbc\")\n",
    "coreRepo = \"{}/{}\".format(repoBase, CORE_NAME)\n",
    "thisRepo = \"{}/{}\".format(repoBase, NAME)\n",
    "\n",
    "coreTf = \"{}/tf/{}\".format(coreRepo, VERSION)\n",
    "\n",
    "thisTemp = \"{}/_temp/{}\".format(thisRepo, VERSION)\n",
    "thisTempTf = \"{}/tf\".format(thisTemp)\n",
    "\n",
    "thisTf = \"{}/tf/{}\".format(thisRepo, VERSION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test\n",
    "\n",
    "Check whether this conversion is needed in the first place.\n",
    "Only when run as a script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "if SCRIPT:\n",
    "    (good, work) = utils.mustRun(\n",
    "        None, \"{}/.tf/{}.tfx\".format(thisTf, \"tree\"), force=FORCE\n",
    "    )\n",
    "    if not good:\n",
    "        stop(good=False)\n",
    "    if not work:\n",
    "        stop(good=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load the TF data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".       0.00s Load the existing TF dataset                                                   .\n",
      "..............................................................................................\n",
      "This is Text-Fabric 9.1.7\n",
      "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n",
      "\n",
      "114 features found and 0 ignored\n"
     ]
    }
   ],
   "source": [
    "utils.caption(4, \"Load the existing TF dataset\")\n",
    "TF = Fabric(locations=coreTf, modules=[\"\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "# Load data\n",
    "We load the some features of the\n",
    "[BHSA](https://github.com/etcbc/bhsa) data.\n",
    "See the [feature documentation](https://etcbc.github.io/bhsa/features/hebrew/2017/0_home.html) for more info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "sp = \"part_of_speech\" if VERSION == \"3\" else \"sp\"\n",
    "rela = \"clause_constituent_relation\" if VERSION == \"3\" else \"rela\"\n",
    "ptyp = \"phrase_type\" if VERSION == \"3\" else \"typ\"\n",
    "ctyp = \"clause_atom_type\" if VERSION == \"3\" else \"typ\"\n",
    "g_word_utf8 = \"text\" if VERSION == \"3\" else \"g_word_utf8\"\n",
    "# -"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.00s loading features ...\n",
      "   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API\n",
      "    11s All features loaded/computed - for details use TF.isLoaded()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('Computed',\n",
       "  'computed-data',\n",
       "  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),\n",
       " ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),\n",
       " ('Fabric', 'loading', ('TF',)),\n",
       " ('Locality', 'locality', ('L Locality',)),\n",
       " ('Nodes', 'navigating-nodes', ('N Nodes',)),\n",
       " ('Features',\n",
       "  'node-features',\n",
       "  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),\n",
       " ('Search', 'search', ('S Search',)),\n",
       " ('Text', 'text', ('T Text',))]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "api = TF.load(\n",
    "    f\"\"\"\n",
    "    {sp} {rela} {ptyp} {ctyp}\n",
    "    {g_word_utf8}\n",
    "    mother\n",
    "\"\"\"\n",
    ")\n",
    "api.makeAvailableIn(globals())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to make convenient labels for constituents, words and clauses, based on the\n",
    "the types of textual objects and the features\n",
    "`sp` and `rela`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Node types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "typeInfo = (\n",
    "    (\"word\", \"\"),\n",
    "    (\"subphrase\", \"U\"),\n",
    "    (\"phrase\", \"P\"),\n",
    "    (\"clause\", \"C\"),\n",
    "    (\"sentence\", \"S\"),\n",
    ")\n",
    "typeTable = dict(t for t in typeInfo)\n",
    "typeOrder = [t[0] for t in typeInfo]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part of speech"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('adjv', 10141),\n",
       " ('advb', 4603),\n",
       " ('art', 30387),\n",
       " ('conj', 62737),\n",
       " ('inrg', 1303),\n",
       " ('intj', 1912),\n",
       " ('nega', 6059),\n",
       " ('nmpr', 35607),\n",
       " ('prde', 2678),\n",
       " ('prep', 73298),\n",
       " ('prin', 1026),\n",
       " ('prps', 5035),\n",
       " ('subs', 125583),\n",
       " ('verb', 75451)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted(Fs(sp).freqList(), key=lambda x: x[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "posTable = {\n",
    "    \"adjv\": \"aj\",\n",
    "    \"adjective\": \"aj\",\n",
    "    \"advb\": \"av\",\n",
    "    \"adverb\": \"av\",\n",
    "    \"art\": \"dt\",\n",
    "    \"article\": \"dt\",\n",
    "    \"conj\": \"cj\",\n",
    "    \"conjunction\": \"cj\",\n",
    "    \"inrg\": \"ir\",\n",
    "    \"interrogative\": \"ir\",\n",
    "    \"intj\": \"ij\",\n",
    "    \"interjection\": \"ij\",\n",
    "    \"nega\": \"ng\",\n",
    "    \"negative\": \"ng\",\n",
    "    \"nmpr\": \"n-pr\",\n",
    "    \"pronoun\": \"pr\",\n",
    "    \"prde\": \"pr-dem\",\n",
    "    \"prep\": \"pp\",\n",
    "    \"preposition\": \"pp\",\n",
    "    \"prin\": \"pr-int\",\n",
    "    \"prps\": \"pr-ps\",\n",
    "    \"subs\": \"n\",\n",
    "    \"noun\": \"n\",\n",
    "    \"verb\": \"vb\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `rela`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Adju', 6426),\n",
       " ('Appo', 5884),\n",
       " ('Attr', 5811),\n",
       " ('Cmpl', 298),\n",
       " ('Coor', 3660),\n",
       " ('Link', 1317),\n",
       " ('NA', 630059),\n",
       " ('Objc', 1327),\n",
       " ('Para', 1559),\n",
       " ('PrAd', 308),\n",
       " ('PreC', 162),\n",
       " ('ReVo', 312),\n",
       " ('Resu', 1666),\n",
       " ('RgRc', 856),\n",
       " ('Sfxs', 75),\n",
       " ('Spec', 5565),\n",
       " ('Subj', 506),\n",
       " ('adj', 4138),\n",
       " ('atr', 3064),\n",
       " ('dem', 1847),\n",
       " ('mod', 941),\n",
       " ('par', 11946),\n",
       " ('rec', 34989)]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted(Fs(rela).freqList(), key=lambda x: x[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "ccrInfo = {\n",
    "    \"Adju\": (\"r\", \"Cadju\"),\n",
    "    \"Appo\": (\"r\", \"Cappo\"),\n",
    "    \"Attr\": (\"r\", \"Cattr\"),\n",
    "    \"Cmpl\": (\"r\", \"Ccmpl\"),\n",
    "    \"Coor\": (\"x\", \"Ccoor\"),\n",
    "    \"CoVo\": (\"n\", \"Ccovo\"),\n",
    "    \"Link\": (\"r\", \"Clink\"),\n",
    "    \"Objc\": (\"r\", \"Cobjc\"),\n",
    "    \"Para\": (\"r\", \"Cpara\"),\n",
    "    \"PrAd\": (\"r\", \"Cprad\"),\n",
    "    \"PreC\": (\"r\", \"Cprec\"),\n",
    "    \"Pred\": (\"r\", \"Cpred\"),\n",
    "    \"ReVo\": (\"n\", \"Crevo\"),\n",
    "    \"Resu\": (\"n\", \"Cresu\"),\n",
    "    \"RgRc\": (\"r\", \"Crgrc\"),\n",
    "    \"Sfxs\": (\"r\", \"Csfxs\"),\n",
    "    \"Spec\": (\"r\", \"Cspec\"),\n",
    "    \"Subj\": (\"r\", \"Csubj\"),\n",
    "    \"NA\": (\"n\", \"C\"),\n",
    "    \"none\": (\"n\", \"C\"),\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "treeTypes = (\"sentence\", \"clause\", \"phrase\", \"subphrase\", \"word\")\n",
    "(rootType, leafType, clauseType, phraseType) = (\n",
    "    treeTypes[0],\n",
    "    treeTypes[-1],\n",
    "    treeTypes[1],\n",
    "    treeTypes[2],\n",
    ")\n",
    "ccrTable = dict((c[0], c[1][1]) for c in ccrInfo.items())\n",
    "ccrClass = dict((c[0], c[1][0]) for c in ccrInfo.items())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can actually construct the tree by initializing a tree object.\n",
    "After that we call its ``restructureClauses()`` method.\n",
    "\n",
    "Then we have two tree structures for each sentence:\n",
    "\n",
    "* the `etree`, i.e. the tree obtained by working out the embedding relationships and nothing else\n",
    "* the `rtree`, i.e. the tree obtained by restructuring the `etree`\n",
    "\n",
    "We have several tree relationships at our disposal:\n",
    "\n",
    "* `eparent` and its inverse `echildren`\n",
    "* `rparent` and its inverse `rchildren`\n",
    "* `eldest_sister` going from a mother clause of kind ``Coor`` (case 2) to its daughter clauses\n",
    "* *sisters* being the inverse of `eldest_sister`\n",
    "\n",
    "where `eldest_sister` and *sisters* only occur in the `rtree`.\n",
    "\n",
    "This will take a while (25 seconds approx on a MacBook Air 2012, 6 on a MacBook Pro in 2017, and now just 4 seconds)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.00s loading features ...\n",
      "  0.11s All additional features loaded - for details use TF.isLoaded()\n",
      "  0.11s Start computing parent and children relations for objects of type sentence, clause, phrase, subphrase, word\n",
      "  0.41s 100000 nodes\n",
      "  0.69s 200000 nodes\n",
      "  2.37s 300000 nodes\n",
      "  2.66s 400000 nodes\n",
      "  2.93s 500000 nodes\n",
      "  3.21s 600000 nodes\n",
      "  3.50s 700000 nodes\n",
      "  3.79s 800000 nodes\n",
      "  4.07s 900000 nodes\n",
      "  4.19s 945491 nodes: 881774 have parents and 518901 have children\n"
     ]
    }
   ],
   "source": [
    "tree = Tree(\n",
    "    TF,\n",
    "    otypes=treeTypes,\n",
    "    phraseType=phraseType,\n",
    "    clauseType=clauseType,\n",
    "    ccrFeature=rela,\n",
    "    ptFeature=ptyp,\n",
    "    posFeature=sp,\n",
    "    motherFeature=\"mother\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  4.20s Restructuring clauses: deep copying tree relations\n",
      "  5.91s Pass 0: Storing mother relationship\n",
      "  5.96s 20791 clauses have a mother\n",
      "  5.96s All clauses have mothers of types in {'word', 'subphrase', 'clause', 'phrase', 'sentence'}\n",
      "  5.96s Pass 1: all clauses except those of type Coor\n",
      "  6.01s Pass 2: clauses of type Coor only\n",
      "  6.04s Mothers applied. Found 0 motherless clauses.\n",
      "  6.04s 3233 nodes have 1 sisters\n",
      "  6.04s 200 nodes have 2 sisters\n",
      "  6.04s 9 nodes have 3 sisters\n",
      "  6.04s There are 3660 sisters, 3442 nodes have sisters.\n",
      "..............................................................................................\n",
      ".         17s Ready for processing                                                           .\n",
      "..............................................................................................\n"
     ]
    }
   ],
   "source": [
    "tree.restructureClauses(ccrClass)\n",
    "results = tree.relations()\n",
    "parent = results[\"rparent\"]\n",
    "sisters = results[\"sisters\"]\n",
    "children = results[\"rchildren\"]\n",
    "elderSister = results[\"elderSister\"]\n",
    "utils.caption(4, \"Ready for processing\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Sanity check\n",
    "# 6\n",
    "\n",
    "# If there are blocking errors we collect the nodes of those sentences in the set\n",
    "# `skip`, so that later on we can skip them easily.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".         17s Verifying whether all slots are preserved under restructuring                  .\n",
      "..............................................................................................\n",
      "..............................................................................................\n",
      ".         17s Expected mismatches: ??                                                        .\n",
      "..............................................................................................\n"
     ]
    }
   ],
   "source": [
    "utils.caption(4, \"Verifying whether all slots are preserved under restructuring\")\n",
    "expectedMismatches = {\n",
    "    \"3\": 13,\n",
    "    \"4\": 3,\n",
    "    \"4b\": 0,\n",
    "    \"2016\": 0,\n",
    "    \"2017\": 0,\n",
    "}\n",
    "utils.caption(4, \"Expected mismatches: {}\".format(expectedMismatches.get(VERSION, \"??\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".         19s 0 mismatches                                                                   .\n",
      "..............................................................................................\n"
     ]
    }
   ],
   "source": [
    "skip = set()\n",
    "errors = []\n",
    "for snode in F.otype.s(rootType):\n",
    "    declaredSlots = set(E.oslots.s(snode))\n",
    "    results = {}\n",
    "    thisgood = {}\n",
    "    for kind in (\"e\", \"r\"):\n",
    "        results[kind] = set(\n",
    "            lf for lf in tree.getLeaves(snode, kind) if F.otype.v(lf) == leafType\n",
    "        )\n",
    "        thisgood[kind] = declaredSlots == results[kind]\n",
    "        # if not thisgood[kind]:\n",
    "        #    print('{} D={}\\n  L={}'.format(kind, declaredSlots, results[kind]))\n",
    "        #    i -= 1\n",
    "    # if i == 0: break\n",
    "    if False in thisgood.values():\n",
    "        errors.append((snode, thisgood[\"e\"], thisgood[\"r\"]))\n",
    "nErrors = len(errors)\n",
    "if nErrors:\n",
    "    utils.caption(4, \"{} mismatches:\".format(len(errors)), good=False)\n",
    "    mine = min(20, len(errors))\n",
    "    skip |= {e[0] for e in errors}\n",
    "    for (s, e, r) in errors[0:mine]:\n",
    "        utils.caption(\n",
    "            4,\n",
    "            \"{} embedding: {}; restructd: {}\".format(\n",
    "                s, \"OK\" if e else \"XX\", \"OK\" if r else \"XX\"\n",
    "            ),\n",
    "            good=False,\n",
    "        )\n",
    "else:\n",
    "    utils.caption(4, \"0 mismatches\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Delivering Trees\n",
    "\n",
    "We are going to deliver the trees as a feature of sentence nodes.\n",
    "\n",
    "There will be two features in fact:\n",
    "\n",
    "* `tree`: minimalistic\n",
    "* `treen`: with node information of non-slot nodes\n",
    "\n",
    "In order to produce this, we need to write appropriate functions to pass to `writeTree`: `getTag()` and `getTagN()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "## `getTag(node)`\n",
    "\n",
    "This function produces for each node\n",
    "\n",
    "* a tag string,\n",
    "* a part-of-speech representation,\n",
    "* a textual position (slot number),\n",
    "* a boolean which tells if this node is a leaf or not.\n",
    "\n",
    "This function will be passed to the `writeTree()` function in the `tree` module.\n",
    "By supplying a different function, you can control a lot of the characteristics of the\n",
    "written tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "def getTag(node):\n",
    "    otype = F.otype.v(node)\n",
    "    tag = typeTable[otype]\n",
    "    if tag == \"P\":\n",
    "        tag = Fs(ptyp).v(node)\n",
    "    elif tag == \"C\":\n",
    "        tag = ccrTable[Fs(rela).v(node)]\n",
    "    isWord = tag == \"\"\n",
    "    pos = posTable[Fs(sp).v(node)] if isWord else None\n",
    "    slot = node if isWord else None\n",
    "    text = '\"{}\"'.format(Fs(g_word_utf8).v(node)) if isWord else None\n",
    "    return (tag, pos, slot, text, isWord)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "This is a variant on `getTag()` where we put the node number into the tag, between `{ }`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getTagN(node):\n",
    "    otype = F.otype.v(node)\n",
    "    tag = typeTable[otype]\n",
    "    if tag == \"P\":\n",
    "        tag = Fs(ptyp).v(node)\n",
    "    elif tag == \"C\":\n",
    "        tag = ccrTable[Fs(rela).v(node)]\n",
    "    isWord = tag == \"\"\n",
    "    if not isWord:\n",
    "        tag += \"{\" + str(node) + \"}\"\n",
    "    pos = posTable[Fs(sp).v(node)] if isWord else None\n",
    "    slot = node if isWord else None\n",
    "    text = '\"{}\"'.format(Fs(g_word_utf8).v(node)) if isWord else None\n",
    "    return (tag, pos, slot, text, isWord)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, here is the production of the whole set of trees.\n",
    "\n",
    "Now we generate the data for two TF features `tree` and `treen`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".         19s Exporting sentence trees to TF                                                 .\n",
      "..............................................................................................\n",
      "|         20s 10000 trees composed\n",
      "|         22s 20000 trees composed\n",
      "|         23s 30000 trees composed\n",
      "|         24s 40000 trees composed\n",
      "|         25s 50000 trees composed\n",
      "|         27s 60000 trees composed\n",
      "..............................................................................................\n",
      ".         27s 63717 trees composed                                                           .\n",
      "..............................................................................................\n"
     ]
    }
   ],
   "source": [
    "utils.caption(4, \"Exporting {} trees to TF\".format(rootType))\n",
    "s = 0\n",
    "chunk = 10000\n",
    "sc = 0\n",
    "treeData = {}\n",
    "treeDataN = {}\n",
    "for node in F.otype.s(rootType):\n",
    "    if node in skip:\n",
    "        continue\n",
    "    (treeRep, wordsRep, bSlot) = tree.writeTree(\n",
    "        node, \"r\", getTag, rev=False, leafNumbers=True\n",
    "    )\n",
    "    (treeNRep, wordsNRep, bSlotN) = tree.writeTree(\n",
    "        node, \"r\", getTagN, rev=False, leafNumbers=True\n",
    "    )\n",
    "    treeData[node] = treeRep\n",
    "    treeDataN[node] = treeNRep\n",
    "    s += 1\n",
    "    sc += 1\n",
    "    if sc == chunk:\n",
    "        utils.caption(0, \"{} trees composed\".format(s))\n",
    "        sc = 0\n",
    "utils.caption(4, \"{} trees composed\".format(s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "genericMetaPath = f\"{thisRepo}/yaml/generic.yaml\"\n",
    "treesMetaPath = f\"{thisRepo}/yaml/trees.yaml\"\n",
    "\n",
    "with open(genericMetaPath) as fh:\n",
    "    genericMeta = yaml.load(fh, Loader=yaml.FullLoader)\n",
    "    genericMeta[\"version\"] = VERSION\n",
    "with open(treesMetaPath) as fh:\n",
    "    treesMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))\n",
    "\n",
    "metaData = {\"\": genericMeta, **treesMeta}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "nodeFeatures = dict(tree=treeData, treen=treeDataN)\n",
    "\n",
    "for f in nodeFeatures:\n",
    "    metaData[f][\"valueType\"] = \"str\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "..............................................................................................\n",
      ".      5m 01s Writing tree feature to TF                                                     .\n",
      "..............................................................................................\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "utils.caption(4, \"Writing tree feature to TF\")\n",
    "TFw = Fabric(locations=thisTempTf, silent=True)\n",
    "TFw.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Diffs\n",
    "\n",
    "Check differences with previous versions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "utils.checkDiffs(thisTempTf, thisTf, only=set(nodeFeatures))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deliver\n",
    "\n",
    "Copy the new TF features from the temporary location where they have been created to their final destination."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "In[38]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "utils.deliverDataset(thisTempTf, thisTf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compile TF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "In[39]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "utils.caption(4, \"Load and compile the new TF features\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "TF = Fabric(locations=[coreTf, thisTf], modules=[\"\"])\n",
    "api = TF.load(\" \".join(nodeFeatures))\n",
    "api.makeAvailableIn(globals())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "In[40]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "utils.caption(4, \"Basic tests\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "utils.caption(4, \"Sample sentences in tree form\")\n",
    "sentences = F.otype.s(\"sentence\")\n",
    "examples = (sentences[0], sentences[len(sentences) // 2], sentences[-1])\n",
    "for s in examples:\n",
    "    utils.caption(0, F.tree.v(s))\n",
    "for s in examples:\n",
    "    utils.caption(0, F.treen.v(s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# End of pipeline\n",
    "\n",
    "If this notebook is run with the purpose of generating data, this is the end then.\n",
    "\n",
    "After this tests and examples are run."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "In[41]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if SCRIPT:\n",
    "    stop(good=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See the tutorial\n",
    "[trees](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/bhsa/trees.ipynb)\n",
    "for how to make use of this feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Checking for sanity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us see whether the trees we have constructed satisfy some sanity constraints.\n",
    "After all, the algorithm is based on certain assumptions about the data,\n",
    "but are those assumptions valid?\n",
    "And restructuring is a tricky operation, do we have confidence that nothing went wrong?\n",
    "\n",
    "1. How many sentence nodes? From earlier queries we know what to expect.\n",
    "1. Does any sentence have a parent?\n",
    "   If so, there is something wrong with our assumptions or algorithm.\n",
    "1. Is every top node a sentence?\n",
    "   If not, we have material outside a sentence, which contradicts the assumptions.\n",
    "1. Do you reach all sentences if you go up from words?\n",
    "   If not, some sentences do not contain words.\n",
    "1. Do you reach all words if you go down from sentences?\n",
    "   If not, some words have become disconnected from their sentences.\n",
    "1. Do you reach the same words in reconstructed trees as in embedded trees?\n",
    "   If not, some sentence material has got lost during the restructuring process.\n",
    "1. From what object types to what object types does the parent relationship link?\n",
    "   Here we check that parents do not link object types\n",
    "   that are too distant in the object type ranking.\n",
    "1. How many nodes have mothers and how many mothers can a node have?\n",
    "   We expect at most one.\n",
    "1. From what object types to what object types does the mother relationship link?\n",
    "1. Is the mother of a clause always in the same sentence?\n",
    "   If not, foreign sentences will be drawn in, leading to (very) big chunks.\n",
    "   This may occur when we use mother relationships in cases where\n",
    "   `rela` has different values than the ones that should trigger restructuring.\n",
    "1. Has the max/average tree depth increased after restructuring?\n",
    "   By how much? This is meant as an indication by how much\n",
    "   our tree structures improve in significant hierarchy\n",
    "   when we take the mother relationship into account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    14s Counting sentences ... (expecting 63711)\n",
      "    14s There are 63711 sentences\n"
     ]
    }
   ],
   "source": [
    "# 1\n",
    "expectedSentences = {\n",
    "    \"3\": 71354,\n",
    "    \"4\": 66045,\n",
    "    \"4b\": 63586,\n",
    "    \"2016\": 63570,\n",
    "    \"2017\": 63711,\n",
    "}\n",
    "TF.info(\n",
    "    \"Counting {}s ... (expecting {})\".format(\n",
    "        rootType, expectedSentences.get(VERSION, \"??\")\n",
    "    )\n",
    ")\n",
    "TF.info(\"There are {} {}s\".format(len(list(F.otype.s(rootType))), rootType))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    15s Checking parents of sentences ... (expecting none)\n",
      "    15s No sentence has a parent\n"
     ]
    }
   ],
   "source": [
    "# 2\n",
    "TF.info(\"Checking parents of {}s ... (expecting none)\".format(rootType))\n",
    "exceptions = set()\n",
    "for node in F.otype.s(rootType):\n",
    "    if node in parent:\n",
    "        exceptions.add(node)\n",
    "if len(exceptions) == 0:\n",
    "    TF.info(\"No {} has a parent\".format(rootType))\n",
    "else:\n",
    "    TF.error(\"{} {}s have a parent:\".format(len(exceptions), rootType))\n",
    "    for n in sorted(exceptions):\n",
    "        p = parent[n]\n",
    "        TF.error(\n",
    "            \"{} {} [{}] has {} parent {} [{}]\".format(\n",
    "                rootType, n, tree.slotss(n), F.otype.v(p), p, tree.slotss(p)\n",
    "            )\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    15s Checking the types of root nodes ... (should all be sentences)\n",
      "    15s Expected roots which are non-sentences: 0\n",
      "    16s 63711 sentences seen\n",
      "    16s All top nodes are sentences\n"
     ]
    }
   ],
   "source": [
    "# 3 (again a check on #1)\n",
    "TF.info(\"Checking the types of root nodes ... (should all be {}s)\".format(rootType))\n",
    "expectedTops = {\n",
    "    \"3\": 0,\n",
    "    \"4\": \"3 subphrases\",\n",
    "    \"4b\": 0,\n",
    "    \"2016\": 0,\n",
    "    \"2017\": 0,\n",
    "}\n",
    "TF.info(\n",
    "    \"Expected roots which are non-{}s: {}\".format(\n",
    "        rootType, expectedTops.get(VERSION, \"??\")\n",
    "    )\n",
    ")\n",
    "exceptions = collections.defaultdict(lambda: [])\n",
    "sn = 0\n",
    "for node in N.walk():\n",
    "    otype = F.otype.v(node)\n",
    "    if otype not in typeTable:\n",
    "        continue\n",
    "    if otype == rootType:\n",
    "        sn += 1\n",
    "    if node not in parent and node not in elderSister and otype != rootType:\n",
    "        exceptions[otype].append(node)\n",
    "TF.info(\"{} {}s seen\".format(sn, rootType))\n",
    "\n",
    "if len(exceptions) == 0:\n",
    "    TF.info(\"All top nodes are {}s\".format(rootType))\n",
    "else:\n",
    "    TF.error(\"Top nodes which are not {}s:\".format(rootType))\n",
    "    for t in sorted(exceptions):\n",
    "        TF.error(\"{}: {}x\".format(t, len(exceptions[t])), tm=False)\n",
    "\n",
    "for c in exceptions[clauseType]:\n",
    "    (s, st) = tree.getRoot(c, \"e\")\n",
    "    v = T.sectionFromNode(s)\n",
    "    TF.error(\n",
    "        \"{}={}, {}={}={}, verse={}\".format(clauseType, c, rootType, st, s, v), tm=False\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    16s Embedding trees\n",
      "    16s Starting from word nodes ...\n",
      "From 426584 word nodes reached 63711 sentence nodes\n",
      "    17s Starting from sentence nodes ...\n",
      "From 63711 sentence nodes reached 426584 word nodes\n",
      "    18s Restructd trees\n",
      "    18s Starting from word nodes ...\n",
      "From 426584 word nodes reached 63711 sentence nodes\n",
      "    18s Starting from sentence nodes ...\n",
      "From 63711 sentence nodes reached 425419 word nodes\n",
      "    19s Done\n"
     ]
    }
   ],
   "source": [
    "# 4, 5\n",
    "def getTop(kind, rel, rela, multi):\n",
    "    seen = set()\n",
    "    topNodes = set()\n",
    "    startNodes = set(F.otype.s(kind))\n",
    "    nextNodes = startNodes\n",
    "    TF.info(\"Starting from {} nodes ...\".format(kind))\n",
    "    while len(nextNodes):\n",
    "        newNextNodes = set()\n",
    "        for node in nextNodes:\n",
    "            if node in seen:\n",
    "                continue\n",
    "            seen.add(node)\n",
    "            isTop = True\n",
    "            if node in rel:\n",
    "                isTop = False\n",
    "                if multi:\n",
    "                    for c in rel[node]:\n",
    "                        newNextNodes.add(c)\n",
    "                else:\n",
    "                    newNextNodes.add(rel[node])\n",
    "            if node in rela:\n",
    "                isTop = False\n",
    "                if multi:\n",
    "                    for c in rela[node]:\n",
    "                        newNextNodes.add(c)\n",
    "                else:\n",
    "                    newNextNodes.add(rela[node])\n",
    "            if isTop:\n",
    "                topNodes.add(node)\n",
    "        nextNodes = newNextNodes\n",
    "    topTypes = collections.defaultdict(lambda: 0)\n",
    "    for t in topNodes:\n",
    "        topTypes[F.otype.v(t)] += 1\n",
    "    for t in topTypes:\n",
    "        TF.info(\n",
    "            \"From {} {} nodes reached {} {} nodes\".format(\n",
    "                len(startNodes), kind, topTypes[t], t\n",
    "            ),\n",
    "            tm=False,\n",
    "        )\n",
    "\n",
    "\n",
    "TF.info(\"Embedding trees\")\n",
    "getTop(leafType, tree.eparent, {}, False)\n",
    "getTop(rootType, tree.echildren, {}, True)\n",
    "TF.info(\"Restructd trees\")\n",
    "getTop(leafType, tree.rparent, tree.elderSister, False)\n",
    "getTop(rootType, tree.rchildren, tree.sisters, True)\n",
    "TF.info(\"Done\")\n",
    "\n",
    "\n",
    "# 7\n",
    "TF.info(\"Which types embed which types and how often? ...\")\n",
    "for kind in (\"e\", \"r\"):\n",
    "    pLinkedTypes = collections.defaultdict(lambda: 0)\n",
    "    parent = tree.eparent if kind == \"e\" else tree.rparent\n",
    "    kindRep = \"embedding\" if kind == \"e\" else \"restructd\"\n",
    "    for (c, p) in parent.items():\n",
    "        pLinkedTypes[(F.otype.v(c), F.otype.v(p))] += 1\n",
    "    TF.info(\"Found {} parent ({}) links between types\".format(len(parent), kindRep))\n",
    "    for lt in sorted(pLinkedTypes):\n",
    "        TF.info(\"{}: {}x\".format(lt, pLinkedTypes[lt]), tm=False)\n",
    "\n",
    "# 8\n",
    "TF.info(\"How many mothers can nodes have? ...\")\n",
    "motherLen = {}\n",
    "for c in N.walk():\n",
    "    lms = list(E.mother.f(c))\n",
    "    nms = len(lms)\n",
    "    if nms:\n",
    "        motherLen[c] = nms\n",
    "count = collections.defaultdict(lambda: 0)\n",
    "for c in tree.mother:\n",
    "    count[motherLen[c]] += 1\n",
    "TF.info(\"There are {} tree nodes with a mother\".format(len(tree.mother)))\n",
    "for cnt in sorted(count):\n",
    "    TF.info(\n",
    "        \"{} nodes have {} mother{}\".format(count[cnt], cnt, \"s\" if cnt != 1 else \"\"),\n",
    "        tm=False,\n",
    "    )\n",
    "\n",
    "# 9\n",
    "TF.info(\"Which types have mother links to which types and how often? ...\")\n",
    "mLinkedTypes = collections.defaultdict(lambda: set())\n",
    "for (c, m) in tree.mother.items():\n",
    "    ctype = F.otype.v(c)\n",
    "    mLinkedTypes[(ctype, Fs(rela).v(c), F.otype.v(m))].add(c)\n",
    "TF.info(\"Found {} mother links between types\".format(len(parent)))\n",
    "for lt in sorted(mLinkedTypes):\n",
    "    TF.info(\"{}: {}x\".format(lt, len(mLinkedTypes[lt])), tm=False)\n",
    "\n",
    "# 10\n",
    "TF.info(\"Counting {}s with mothers in another {}\".format(clauseType, rootType))\n",
    "expectedOther = {\n",
    "    \"3\": 2,\n",
    "    \"4\": 0,\n",
    "    \"4b\": 0,\n",
    "    \"2016\": 0,\n",
    "    \"2017\": 0,\n",
    "}\n",
    "TF.info(\n",
    "    \"Expecting {} {}s with mothers in another {}\".format(\n",
    "        expectedOther.get(VERSION, \"??\"), clauseType, rootType\n",
    "    )\n",
    ")\n",
    "exceptions = set()\n",
    "for node in tree.mother:\n",
    "    if F.otype.v(node) not in typeTable:\n",
    "        continue\n",
    "    mNode = tree.mother[node]\n",
    "    sNode = tree.getRoot(node, \"e\")\n",
    "    smNode = tree.getRoot(mNode, \"e\")\n",
    "    if sNode != smNode:\n",
    "        exceptions.add((node, sNode, smNode))\n",
    "TF.info(\"{} nodes have a mother in another {}\".format(len(exceptions), rootType))\n",
    "for (n, sn, smn) in exceptions:\n",
    "    TF.error(\n",
    "        \"[{} {}]({}) occurs in {} but has mother in {}\".format(\n",
    "            F.otype.v(n), tree.slotss(n), n, sn, smn\n",
    "        ),\n",
    "        tm=False,\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    28s Computing lengths and depths\n",
      "    30s 63711 trees seen, of which in 10904 cases restructuring makes a difference in depth\n",
      "    30s Embedding trees: max depth =  9, average depth = 3.6\n",
      "    30s Restructd trees: max depth = 24, average depth = 3.9\n",
      "    30s Statistics for cases where restructuring makes a difference:\n",
      "    30s Embedding trees: max depth =  8, average depth = 3.7\n",
      "    30s Restructd trees: max depth = 24, average depth = 5.4\n",
      "    30s Total number of leaves in the trees: 426584, average number of leaves = 6.7\n"
     ]
    }
   ],
   "source": [
    "# 11\n",
    "TF.info(\"Computing lengths and depths\")\n",
    "nTrees = 0\n",
    "rnTrees = 0\n",
    "totalDepth = {\"e\": 0, \"r\": 0}\n",
    "rTotalDepth = {\"e\": 0, \"r\": 0}\n",
    "maxDepth = {\"e\": 0, \"r\": 0}\n",
    "rMaxDepth = {\"e\": 0, \"r\": 0}\n",
    "totalLength = 0\n",
    "\n",
    "for node in F.otype.s(rootType):\n",
    "    nTrees += 1\n",
    "    totalLength += tree.length(node)\n",
    "    thisDepth = {}\n",
    "    for kind in (\"e\", \"r\"):\n",
    "        thisDepth[kind] = tree.depth(node, kind)\n",
    "    different = thisDepth[\"e\"] != thisDepth[\"r\"]\n",
    "    if different:\n",
    "        rnTrees += 1\n",
    "    for kind in (\"e\", \"r\"):\n",
    "        if thisDepth[kind] > maxDepth[kind]:\n",
    "            maxDepth[kind] = thisDepth[kind]\n",
    "        totalDepth[kind] += thisDepth[kind]\n",
    "        if different:\n",
    "            if thisDepth[kind] > rMaxDepth[kind]:\n",
    "                rMaxDepth[kind] = thisDepth[kind]\n",
    "            rTotalDepth[kind] += thisDepth[kind]\n",
    "\n",
    "TF.info(\n",
    "    \"{} trees seen, of which in {} cases restructuring makes a difference in depth\".format(\n",
    "        nTrees,\n",
    "        rnTrees,\n",
    "    )\n",
    ")\n",
    "if nTrees > 0:\n",
    "    TF.info(\n",
    "        \"Embedding trees: max depth = {:>2}, average depth = {:.2g}\".format(\n",
    "            maxDepth[\"e\"],\n",
    "            totalDepth[\"e\"] / nTrees,\n",
    "        )\n",
    "    )\n",
    "    TF.info(\n",
    "        \"Restructd trees: max depth = {:>2}, average depth = {:.2g}\".format(\n",
    "            maxDepth[\"r\"],\n",
    "            totalDepth[\"r\"] / nTrees,\n",
    "        )\n",
    "    )\n",
    "if rnTrees > 0:\n",
    "    TF.info(\"Statistics for cases where restructuring makes a difference:\")\n",
    "    TF.info(\n",
    "        \"Embedding trees: max depth = {:>2}, average depth = {:.2g}\".format(\n",
    "            rMaxDepth[\"e\"],\n",
    "            rTotalDepth[\"e\"] / rnTrees,\n",
    "        )\n",
    "    )\n",
    "    TF.info(\n",
    "        \"Restructd trees: max depth = {:>2}, average depth = {:.2g}\".format(\n",
    "            rMaxDepth[\"r\"],\n",
    "            rTotalDepth[\"r\"] / rnTrees,\n",
    "        )\n",
    "    )\n",
    "TF.info(\n",
    "    \"Total number of leaves in the trees: {}, average number of leaves = {:.2g}\".format(\n",
    "        totalLength,\n",
    "        totalLength / nTrees,\n",
    "    )\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}