{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9a869eb5-7a9e-4e22-b933-7bdbfdc6974a",
   "metadata": {},
   "source": [
    "# Weight calculation PCFG model (GBI treebank/ N1904GBI)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1e404b3-63ba-4fc6-a5be-0773b6cc1412",
   "metadata": {},
   "source": [
    "## Table of content <a class=\"anchor\" id=\"TOC\"></a>\n",
    "* <a href=\"#bullet1\">1 - Introduction</a>\n",
    "* <a href=\"#bullet2\">2 - Create sum of transitions</a>\n",
    "* <a href=\"#bullet3\">3 - Avarage probabilities for the complete set</a>\n",
    "* <a href=\"#bullet4\">4 - Normalizing probabilities per source status</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd612dc1-0624-4739-86bc-0f646c590d7b",
   "metadata": {},
   "source": [
    "# 1 - Introduction <a class=\"anchor\" id=\"bullet1\"></a>\n",
    "##### [Back to TOC](#TOC)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6aa9f28-7c84-4dc6-b5a0-2c507b395e94",
   "metadata": {},
   "source": [
    "PCFG= Probabilistic Context-Free Grammar. It is a type of context-free grammar that associates a probability with each production rule. Each production rule in a PCFG is assigned a probability, indicating the likelihood of using that rule in a derivation.\n",
    "\n",
    "The formula for calculation probability of transtition $\\alpha → \\beta$:\n",
    "\n",
    "$q_{ML}(\\alpha → \\beta) =\\frac{count (\\alpha → \\beta)}{count (\\alpha)}$\n",
    "\n",
    "And consequently:\n",
    "\n",
    "&sum;$_{i=1}^{n}  q_{ML}(\\alpha → \\beta) = 1 $\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ed069ce-4af6-40c9-bdb6-2737f8742fda",
   "metadata": {},
   "source": [
    "Testing dataset: N1904 treebank (GBI)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c38fa20-5e1a-44d5-98f4-34d62d42c0ae",
   "metadata": {},
   "source": [
    "# 2 - Create sum of transitions <a class=\"anchor\" id=\"bullet2\"></a>\n",
    "##### [Back to TOC](#TOC)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d2024bb2-4728-4810-abfd-726499c74430",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import sys\n",
    "import os\n",
    "import time\n",
    "import pickle\n",
    "\n",
    "import re  # used for regular expressions\n",
    "from os import listdir\n",
    "from os.path import isfile, join\n",
    "import xml.etree.ElementTree as ET"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "581c1806-99b1-42ec-874a-fa7b3cd97086",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "BaseDir = 'C:\\\\Users\\\\tonyj\\\\my_new_Jupyter_folder\\\\test_of_xml_etree\\\\'\n",
    "InputDir = BaseDir+'inputfiles\\\\'\n",
    "bo='26-jude'\n",
    "InputFile = os.path.join(InputDir, f'{bo}.xml')\n",
    "tree = ET.parse(InputFile)\n",
    "root = tree.getroot()\n",
    "\n",
    "# Dictionary to store transition frequencies\n",
    "transition_frequencies = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d13c405-9c2c-46ee-bd83-59bb2618eca6",
   "metadata": {},
   "source": [
    "Multiple sets of books are defined here allowing for comparing the calculated probability-values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a544f78c-6be8-4a13-b26d-f17eecfea8af",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "booklist = ['01-matthew', '02-mark', '03-luke', '04-john', '05-acts', '06-romans',\n",
    "           '07-1corinthians','08-2corinthians', '09-galatians', '10-ephesians',\n",
    "           '11-philippians', '12-colossians', '13-1thessalonians', '14-2thessalonians',\n",
    "           '15-1timothy', '16-2timothy', '17-titus', '18-philemon', '19-hebrews', \n",
    "           '20-james', '21-1peter', '22-2peter', '23-1john', '24-2john', '25-3john',\n",
    "           '26-jude', '27-revelation']\n",
    "paullist= ['06-romans', '07-1corinthians','08-2corinthians', '09-galatians', '10-ephesians',\n",
    "           '11-philippians', '12-colossians', '13-1thessalonians', '14-2thessalonians',\n",
    "           '15-1timothy', '16-2timothy', '17-titus', '18-philemon']\n",
    "peterlist= ['21-1peter', '22-2peter']\n",
    "lukelist= ['03-luke','05-acts']\n",
    "johnlist = ['23-1john', '24-2john', '25-3john']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d78a1c6-e381-4220-b726-6b013c44bbc8",
   "metadata": {},
   "source": [
    "# 3 - Avarage probabilities for the complete set <a class=\"anchor\" id=\"bullet3\"></a>\n",
    "##### [Back to TOC](#TOC)\n",
    "\n",
    "i.e. all rules sum op to p=1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "10c4098e-3ced-4fdb-9cc7-d7771ba16dea",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\06-romans.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\07-1corinthians.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\08-2corinthians.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\09-galatians.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\10-ephesians.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\11-philippians.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\12-colossians.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\13-1thessalonians.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\14-2thessalonians.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\15-1timothy.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\16-2timothy.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\17-titus.xml\n",
      "Reading file C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\inputfiles\\18-philemon.xml\n",
      "number of transitions: 95065\n",
      "Transition table for starting condition: S\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "S\tCL\t1929\t0.02029\n",
      "S\tnp\t2285\t0.02404\n",
      "S\tadjp\t4\t4.208e-05\n",
      "\n",
      "\n",
      "Transition table for starting condition: CL\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "CL\tS\t2299\t0.02418\n",
      "CL\tV\t4816\t0.05066\n",
      "CL\tADV\t4784\t0.05032\n",
      "CL\tO\t2271\t0.02389\n",
      "CL\tVC\t637\t0.006701\n",
      "CL\tP\t1115\t0.01173\n",
      "CL\tCL\t7937\t0.08349\n",
      "CL\tIO\t406\t0.004271\n",
      "CL\tTerm\t3410\t0.03587\n",
      "CL\tconj\t56\t0.0005891\n",
      "CL\tnp\t148\t0.001557\n",
      "CL\tintj\t14\t0.0001473\n",
      "CL\tadvp\t136\t0.001431\n",
      "CL\tO2\t57\t0.0005996\n",
      "CL\tptcl\t37\t0.0003892\n",
      "\n",
      "\n",
      "Transition table for starting condition: np\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "np\tnp\t11942\t0.1256\n",
      "np\tTerm\t15789\t0.1661\n",
      "np\tadjp\t1927\t0.02027\n",
      "np\tCL\t955\t0.01005\n",
      "np\tadvp\t301\t0.003166\n",
      "np\tpp\t285\t0.002998\n",
      "np\tconj\t5\t5.26e-05\n",
      "np\tnump\t16\t0.0001683\n",
      "np\tintj\t2\t2.104e-05\n",
      "\n",
      "\n",
      "Transition table for starting condition: adjp\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "adjp\tTerm\t2378\t0.02501\n",
      "adjp\tCL\t113\t0.001189\n",
      "adjp\tadj\t44\t0.0004628\n",
      "adjp\tadjp\t197\t0.002072\n",
      "adjp\tpp\t9\t9.467e-05\n",
      "adjp\tadvp\t20\t0.0002104\n",
      "adjp\tnp\t9\t9.467e-05\n",
      "\n",
      "\n",
      "Transition table for starting condition: V\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "V\tvp\t4816\t0.05066\n",
      "\n",
      "\n",
      "Transition table for starting condition: vp\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "vp\tTerm\t5618\t0.0591\n",
      "vp\tvp\t165\t0.001736\n",
      "vp\tCL\t23\t0.0002419\n",
      "vp\tadvp\t7\t7.363e-05\n",
      "\n",
      "\n",
      "Transition table for starting condition: ADV\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "ADV\tpp\t2365\t0.02488\n",
      "ADV\tadjp\t69\t0.0007258\n",
      "ADV\tadvp\t1260\t0.01325\n",
      "ADV\tCL\t479\t0.005039\n",
      "ADV\tnp\t622\t0.006543\n",
      "ADV\tADV\t20\t0.0002104\n",
      "ADV\tTerm\t7\t7.363e-05\n",
      "\n",
      "\n",
      "Transition table for starting condition: pp\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "pp\tTerm\t3102\t0.03263\n",
      "pp\tnp\t3039\t0.03197\n",
      "pp\tadvp\t76\t0.0007995\n",
      "pp\tpp\t322\t0.003387\n",
      "pp\tprep\t42\t0.0004418\n",
      "\n",
      "\n",
      "Transition table for starting condition: O\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "O\tnp\t2011\t0.02115\n",
      "O\tCL\t259\t0.002724\n",
      "O\tadjp\t1\t1.052e-05\n",
      "\n",
      "\n",
      "Transition table for starting condition: VC\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "VC\tvp\t637\t0.006701\n",
      "\n",
      "\n",
      "Transition table for starting condition: P\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "P\tpp\t221\t0.002325\n",
      "P\tnp\t492\t0.005175\n",
      "P\tCL\t19\t0.0001999\n",
      "P\tadjp\t352\t0.003703\n",
      "P\tadvp\t31\t0.0003261\n",
      "\n",
      "\n",
      "Transition table for starting condition: advp\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "advp\tTerm\t1792\t0.01885\n",
      "advp\tadvp\t72\t0.0007574\n",
      "advp\tadjp\t20\t0.0002104\n",
      "advp\tnp\t27\t0.000284\n",
      "advp\tadv\t39\t0.0004102\n",
      "\n",
      "\n",
      "Transition table for starting condition: IO\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "IO\tnp\t406\t0.004271\n",
      "\n",
      "\n",
      "Transition table for starting condition: conj\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "conj\tTerm\t61\t0.0006417\n",
      "\n",
      "\n",
      "Transition table for starting condition: adj\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "adj\tTerm\t44\t0.0004628\n",
      "\n",
      "\n",
      "Transition table for starting condition: prep\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "prep\tTerm\t42\t0.0004418\n",
      "\n",
      "\n",
      "Transition table for starting condition: intj\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "intj\tTerm\t16\t0.0001683\n",
      "\n",
      "\n",
      "Transition table for starting condition: O2\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "O2\tadjp\t14\t0.0001473\n",
      "O2\tnp\t39\t0.0004102\n",
      "O2\tCL\t4\t4.208e-05\n",
      "\n",
      "\n",
      "Transition table for starting condition: adv\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "adv\tTerm\t39\t0.0004102\n",
      "\n",
      "\n",
      "Transition table for starting condition: ptcl\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "ptcl\tTerm\t37\t0.0003892\n",
      "\n",
      "\n",
      "Transition table for starting condition: nump\n",
      "From\tTo\tTransitions\tAverage Occurrence\n",
      "nump\tTerm\t19\t0.0001999\n",
      "nump\tnump\t3\t3.156e-05\n",
      "nump\tadjp\t3\t3.156e-05\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "def addParentInfo(parent, element):\n",
    "    for child in element:\n",
    "        child.attrib['parent'] = parent\n",
    "        addParentInfo(child, child)\n",
    "\n",
    "def getParent(element):\n",
    "    if 'parent' in element.attrib:\n",
    "        return element.attrib['parent']\n",
    "    else:\n",
    "        return None\n",
    "\n",
    "# Dictionary to store transition frequencies\n",
    "transition_frequencies = {}\n",
    "total_transitions = 0    \n",
    "# Dictionary to store transitions grouped by ('from', 'to') value\n",
    "grouped_transitions = {}\n",
    "\n",
    "for bo in paullist:\n",
    "    InputFile = os.path.join(InputDir, f'{bo}.xml')\n",
    "    print (f'Reading file {InputFile}')\n",
    "    \n",
    "    # Load the XML file\n",
    "    tree = ET.parse(InputFile)\n",
    "    root = tree.getroot()\n",
    "    \n",
    "    # Add 'parent' attribute to each child element\n",
    "    addParentInfo(None, root)\n",
    "    \n",
    "    # Iterate over 'Tree' elements\n",
    "    for tree in root.findall('.//Tree'):\n",
    "        # Iterate over child nodes of the current 'Tree' element\n",
    "        for node in tree.findall('.//Node'):\n",
    "            # Check if the node has child nodes\n",
    "            has_children = bool(list(node))\n",
    "\n",
    "            # Determine the current rule\n",
    "            node_cat = node.get('Cat') if has_children else 'Term'\n",
    "\n",
    "            # Get the parent node using the 'getParent' function\n",
    "            parent_node = getParent(node)\n",
    "\n",
    "            # Check if there is a parent node\n",
    "            if parent_node is not None:\n",
    "                parent_cat = parent_node.get('Cat')\n",
    "                if parent_cat == None and node_cat != None:\n",
    "                    parent_cat = \"Start\"\n",
    "                    continue\n",
    "\n",
    "            # Combine parent and current rule to form the transition\n",
    "            transition = (parent_cat, node_cat)\n",
    "\n",
    "            # Update the frequency count in the dictionary\n",
    "            total_transitions += 1\n",
    "            transition_frequencies[transition] = transition_frequencies.get(transition, 0) + 1\n",
    "\n",
    "print (f'number of transitions: {total_transitions}')\n",
    "            \n",
    "# Group transitions based on ('from', 'to') value\n",
    "for (from_value, to_value), frequency in transition_frequencies.items():\n",
    "    grouped_transitions.setdefault(from_value, []).append((from_value, to_value, frequency))\n",
    "\n",
    "# Print separate tables for each group\n",
    "for from_value, transitions in grouped_transitions.items():\n",
    "    print(f\"Transition table for starting condition: {from_value}\")\n",
    "    print(\"From\\tTo\\tTransitions\\tAverage Occurrence\")\n",
    "    \n",
    "    for from_val, to_val, frequency in transitions:\n",
    "        weight = frequency / total_transitions\n",
    "        print(f'{from_val}\\t{to_val}\\t{frequency}\\t{weight:.4}')\n",
    "    \n",
    "    print('\\n')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9c29796-0cbe-4480-9f2d-4e8dfbc0814e",
   "metadata": {},
   "source": [
    "# 4 - Normalizing probabilities per source status<a class=\"anchor\" id=\"bullet4\"></a>\n",
    "##### [Back to TOC](#TOC)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "e5a3d39d-99fe-4050-b1cf-e6f3d5c60fba",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "loading books ...\n",
      "Finished\tNumber of transitions: 7678\n",
      "\n",
      "Transition table for starting condition: S\n",
      "From\tTo\tOcc.\tWeigth\n",
      "S\tnp\t223\t0.5533\n",
      "S\tCL\t180\t0.4467\n",
      "\n",
      "\n",
      "Transition table for starting condition: CL\n",
      "From\tTo\tOcc.\tWeigth\n",
      "CL\tCL\t743\t0.2964\n",
      "CL\tV\t425\t0.1695\n",
      "CL\tTerm\t295\t0.1177\n",
      "CL\tADV\t271\t0.1081\n",
      "CL\tO\t246\t0.09813\n",
      "CL\tS\t223\t0.08895\n",
      "CL\tP\t111\t0.04428\n",
      "CL\tVC\t104\t0.04148\n",
      "CL\tIO\t36\t0.01436\n",
      "CL\tnp\t28\t0.01117\n",
      "CL\tconj\t12\t0.004787\n",
      "CL\tadvp\t9\t0.00359\n",
      "CL\tO2\t4\t0.001596\n",
      "\n",
      "\n",
      "Transition table for starting condition: np\n",
      "From\tTo\tOcc.\tWeigth\n",
      "np\tTerm\t1267\t0.5599\n",
      "np\tnp\t757\t0.3345\n",
      "np\tadjp\t113\t0.04993\n",
      "np\tCL\t95\t0.04198\n",
      "np\tadvp\t16\t0.00707\n",
      "np\tpp\t15\t0.006628\n",
      "\n",
      "\n",
      "Transition table for starting condition: VC\n",
      "From\tTo\tOcc.\tWeigth\n",
      "VC\tvp\t104\t1.0\n",
      "\n",
      "\n",
      "Transition table for starting condition: vp\n",
      "From\tTo\tOcc.\tWeigth\n",
      "vp\tTerm\t540\t0.98\n",
      "vp\tvp\t11\t0.01996\n",
      "\n",
      "\n",
      "Transition table for starting condition: P\n",
      "From\tTo\tOcc.\tWeigth\n",
      "P\tnp\t47\t0.4234\n",
      "P\tpp\t46\t0.4144\n",
      "P\tadjp\t18\t0.1622\n",
      "\n",
      "\n",
      "Transition table for starting condition: pp\n",
      "From\tTo\tOcc.\tWeigth\n",
      "pp\tTerm\t228\t0.479\n",
      "pp\tnp\t221\t0.4643\n",
      "pp\tpp\t21\t0.04412\n",
      "pp\tadvp\t6\t0.01261\n",
      "\n",
      "\n",
      "Transition table for starting condition: O\n",
      "From\tTo\tOcc.\tWeigth\n",
      "O\tnp\t218\t0.8862\n",
      "O\tCL\t28\t0.1138\n",
      "\n",
      "\n",
      "Transition table for starting condition: V\n",
      "From\tTo\tOcc.\tWeigth\n",
      "V\tvp\t425\t1.0\n",
      "\n",
      "\n",
      "Transition table for starting condition: ADV\n",
      "From\tTo\tOcc.\tWeigth\n",
      "ADV\tpp\t152\t0.5507\n",
      "ADV\tadvp\t92\t0.3333\n",
      "ADV\tnp\t20\t0.07246\n",
      "ADV\tCL\t8\t0.02899\n",
      "ADV\tADV\t2\t0.007246\n",
      "ADV\tTerm\t1\t0.003623\n",
      "ADV\tadjp\t1\t0.003623\n",
      "\n",
      "\n",
      "Transition table for starting condition: IO\n",
      "From\tTo\tOcc.\tWeigth\n",
      "IO\tnp\t36\t1.0\n",
      "\n",
      "\n",
      "Transition table for starting condition: adjp\n",
      "From\tTo\tOcc.\tWeigth\n",
      "adjp\tTerm\t135\t0.9783\n",
      "adjp\tadjp\t2\t0.01449\n",
      "adjp\tCL\t1\t0.007246\n",
      "\n",
      "\n",
      "Transition table for starting condition: advp\n",
      "From\tTo\tOcc.\tWeigth\n",
      "advp\tTerm\t122\t0.9683\n",
      "advp\tadjp\t2\t0.01587\n",
      "advp\tadvp\t2\t0.01587\n",
      "\n",
      "\n",
      "Transition table for starting condition: conj\n",
      "From\tTo\tOcc.\tWeigth\n",
      "conj\tTerm\t12\t1.0\n",
      "\n",
      "\n",
      "Transition table for starting condition: O2\n",
      "From\tTo\tOcc.\tWeigth\n",
      "O2\tnp\t4\t1.0\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# avarages for each seperate transition (i.e. all rules sum op to p=1 per starting condition)\n",
    "\n",
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "def addParentInfo(parent, element):\n",
    "    for child in element:\n",
    "        child.attrib['parent'] = parent\n",
    "        addParentInfo(child, child)\n",
    "\n",
    "def getParent(element):\n",
    "    if 'parent' in element.attrib:\n",
    "        return element.attrib['parent']\n",
    "    else:\n",
    "        return None\n",
    "\n",
    "# Dictionary to store transition frequencies\n",
    "transition_frequencies = {}\n",
    "total_transitions = 0\n",
    "\n",
    "# Dictionary to store transitions grouped by ('from', 'to') value\n",
    "grouped_transitions = {}\n",
    "print('loading books ',end='')\n",
    "\n",
    "for bo in johnlist:\n",
    "    InputFile = os.path.join(InputDir, f'{bo}.xml')\n",
    "    #print (f'Reading file {InputFile}')\n",
    "    print ('.',end='')\n",
    "    \n",
    "    # Load the XML file\n",
    "    tree = ET.parse(InputFile)\n",
    "    root = tree.getroot()\n",
    "    \n",
    "    # Add 'parent' attribute to each child element\n",
    "    addParentInfo(None, root)\n",
    "\n",
    "    # Iterate over 'Tree' elements\n",
    "    for tree in root.findall('.//Tree'):\n",
    "        # Iterate over child nodes of the current 'Tree' element\n",
    "        for node in tree.findall('.//Node'):\n",
    "            # Check if the node has child nodes\n",
    "            has_children = bool(list(node))\n",
    "\n",
    "            # Determine the current rule\n",
    "            node_cat = node.get('Cat') if has_children else 'Term'\n",
    "\n",
    "            # Get the parent node using the 'getParent' function\n",
    "            parent_node = getParent(node)\n",
    "\n",
    "            # Check if there is a parent node\n",
    "            if parent_node is not None:\n",
    "                parent_cat = parent_node.get('Cat')\n",
    "                if parent_cat is None and node_cat is not None:\n",
    "                    parent_cat = \"Start\"\n",
    "                    continue\n",
    "\n",
    "                # Combine parent and current rule to form the transition\n",
    "                transition = (parent_cat, node_cat)\n",
    "\n",
    "                # Update the frequency count in the dictionary\n",
    "                total_transitions += 1\n",
    "                transition_frequencies[transition] = transition_frequencies.get(transition, 0) + 1\n",
    "\n",
    "print (f'\\nFinished\\tNumber of transitions: {total_transitions}\\n')\n",
    "\n",
    "# Group transitions based on ('from', 'to') value\n",
    "for (from_value, to_value), frequency in transition_frequencies.items():\n",
    "    grouped_transitions.setdefault(from_value, []).append((from_value, to_value, frequency))\n",
    "\n",
    "# Print separate tables for each group with sorted transitions\n",
    "for from_value, transitions in grouped_transitions.items():\n",
    "    print(f\"Transition table for starting condition: {from_value}\")\n",
    "    print(\"From\\tTo\\tOcc.\\tWeigth\")\n",
    "    \n",
    "    # Sort transitions based on frequency in descending order\n",
    "    sorted_transitions = sorted(transitions, key=lambda x: x[2], reverse=True)\n",
    "\n",
    "    # Calculate total occurrences for the current table\n",
    "    total_occurrences = sum(occurrence for _, _, occurrence in sorted_transitions)\n",
    "\n",
    "    for from_val, to_val, frequency in sorted_transitions:\n",
    "        # Calculate the average occurrence for each transition\n",
    "        average_occurrence = frequency / total_occurrences\n",
    "        print(f'{from_val}\\t{to_val}\\t{frequency}\\t{average_occurrence:.4}')\n",
    "\n",
    "    print('\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "192d6936-d9ed-40fb-a0e8-9f22f8c8fa30",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}