# Weight calculation PCFG model (GBI treebank/ N1904GBI)

## Table of content 
* 1 - Introduction
* 2 - Create sum of transitions
* 3 - Avarage probabilities for the complete set
* 4 - Normalizing probabilities per source status

# 1 - Introduction 
##### [Back to TOC](#TOC)

PCFG= Probabilistic Context-Free Grammar. It is a type of context-free grammar that associates a probability with each production rule. Each production rule in a PCFG is assigned a probability, indicating the likelihood of using that rule in a derivation.

The formula for calculation probability of transtition $\alpha → \beta$:

$q_{ML}(\alpha → \beta) =\frac{count (\alpha → \beta)}{count (\alpha)}$

And consequently:

∑$_{i=1}^{n} q_{ML}(\alpha → \beta) = 1 $



Testing dataset: N1904 treebank (GBI)

# 2 - Create sum of transitions 
##### [Back to TOC](#TOC)

In [2]:
import pandas as pd
import sys
import os
import time
import pickle

import re # used for regular expressions
from os import listdir
from os.path import isfile, join
import xml.etree.ElementTree as ET

In [3]:
BaseDir = 'C:\\Users\\tonyj\\my_new_Jupyter_folder\\test_of_xml_etree\\'
InputDir = BaseDir+'inputfiles\\'
bo='26-jude'
InputFile = os.path.join(InputDir, f'{bo}.xml')
tree = ET.parse(InputFile)
root = tree.getroot()

# Dictionary to store transition frequencies
transition_frequencies = {}

Multiple sets of books are defined here allowing for comparing the calculated probability-values.

In [4]:
booklist = ['01-matthew', '02-mark', '03-luke', '04-john', '05-acts', '06-romans',
 '07-1corinthians','08-2corinthians', '09-galatians', '10-ephesians',
 '11-philippians', '12-colossians', '13-1thessalonians', '14-2thessalonians',
 '15-1timothy', '16-2timothy', '17-titus', '18-philemon', '19-hebrews', 
 '20-james', '21-1peter', '22-2peter', '23-1john', '24-2john', '25-3john',
 '26-jude', '27-revelation']
paullist= ['06-romans', '07-1corinthians','08-2corinthians', '09-galatians', '10-ephesians',
 '11-philippians', '12-colossians', '13-1thessalonians', '14-2thessalonians',
 '15-1timothy', '16-2timothy', '17-titus', '18-philemon']
peterlist= ['21-1peter', '22-2peter']
lukelist= ['03-luke','05-acts']
johnlist = ['23-1john', '24-2john', '25-3john']

# 3 - Avarage probabilities for the complete set 
##### [Back to TOC](#TOC)

i.e. all rules sum op to p=1.

In [5]:
import xml.etree.ElementTree as ET

def addParentInfo(parent, element):
 for child in element:
 child.attrib['parent'] = parent
 addParentInfo(child, child)

def getParent(element):
 if 'parent' in element.attrib:
 return element.attrib['parent']
 else:
 return None

# Dictionary to store transition frequencies
transition_frequencies = {}
total_transitions = 0 
# Dictionary to store transitions grouped by ('from', 'to') value
grouped_transitions = {}

for bo in paullist:
 InputFile = os.path.join(InputDir, f'{bo}.xml')
 print (f'Reading file {InputFile}')
 
 # Load the XML file
 tree = ET.parse(InputFile)
 root = tree.getroot()
 
 # Add 'parent' attribute to each child element
 addParentInfo(None, root)
 
 # Iterate over 'Tree' elements
 for tree in root.findall('.//Tree'):
 # Iterate over child nodes of the current 'Tree' element
 for node in tree.findall('.//Node'):
 # Check if the node has child nodes
 has_children = bool(list(node))

 # Determine the current rule
 node_cat = node.get('Cat') if has_children else 'Term'

 # Get the parent node using the 'getParent' function
 parent_node = getParent(node)

 # Check if there is a parent node
 if parent_node is not None:
 parent_cat = parent_node.get('Cat')
 if parent_cat == None and node_cat != None:
 parent_cat = "Start"
 continue

 # Combine parent and current rule to form the transition
 transition = (parent_cat, node_cat)

 # Update the frequency count in the dictionary
 total_transitions += 1
 transition_frequencies[transition] = transition_frequencies.get(transition, 0) + 1

print (f'number of transitions: {total_transitions}')
 
# Group transitions based on ('from', 'to') value
for (from_value, to_value), frequency in transition_frequencies.items():
 grouped_transitions.setdefault(from_value, []).append((from_value, to_value, frequency))

# Print separate tables for each group
for from_value, transitions in grouped_transitions.items():
 print(f"Transition table for starting condition: {from_value}")
 print("From\tTo\tTransitions\tAverage Occurrence")
 
 for from_val, to_val, frequency in transitions:
 weight = frequency / total_transitions
 print(f'{from_val}\t{to_val}\t{frequency}\t{weight:.4}')
 
 print('\n')


Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\06-romans.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\07-1corinthians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\08-2corinthians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\09-galatians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\10-ephesians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\11-philippians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\12-colossians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\13-1thessalonians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\14-2thessalonians.xml
Reading file C:\Users\tonyj\my_new_Jupyter_folder\test_of_xml_etree\inputfiles\15-1timothy.xml
Reading file C:\Users\ton

# 4 - Normalizing probabilities per source status
##### [Back to TOC](#TOC)

In [98]:
# avarages for each seperate transition (i.e. all rules sum op to p=1 per starting condition)

import xml.etree.ElementTree as ET

def addParentInfo(parent, element):
 for child in element:
 child.attrib['parent'] = parent
 addParentInfo(child, child)

def getParent(element):
 if 'parent' in element.attrib:
 return element.attrib['parent']
 else:
 return None

# Dictionary to store transition frequencies
transition_frequencies = {}
total_transitions = 0

# Dictionary to store transitions grouped by ('from', 'to') value
grouped_transitions = {}
print('loading books ',end='')

for bo in johnlist:
 InputFile = os.path.join(InputDir, f'{bo}.xml')
 #print (f'Reading file {InputFile}')
 print ('.',end='')
 
 # Load the XML file
 tree = ET.parse(InputFile)
 root = tree.getroot()
 
 # Add 'parent' attribute to each child element
 addParentInfo(None, root)

 # Iterate over 'Tree' elements
 for tree in root.findall('.//Tree'):
 # Iterate over child nodes of the current 'Tree' element
 for node in tree.findall('.//Node'):
 # Check if the node has child nodes
 has_children = bool(list(node))

 # Determine the current rule
 node_cat = node.get('Cat') if has_children else 'Term'

 # Get the parent node using the 'getParent' function
 parent_node = getParent(node)

 # Check if there is a parent node
 if parent_node is not None:
 parent_cat = parent_node.get('Cat')
 if parent_cat is None and node_cat is not None:
 parent_cat = "Start"
 continue

 # Combine parent and current rule to form the transition
 transition = (parent_cat, node_cat)

 # Update the frequency count in the dictionary
 total_transitions += 1
 transition_frequencies[transition] = transition_frequencies.get(transition, 0) + 1

print (f'\nFinished\tNumber of transitions: {total_transitions}\n')

# Group transitions based on ('from', 'to') value
for (from_value, to_value), frequency in transition_frequencies.items():
 grouped_transitions.setdefault(from_value, []).append((from_value, to_value, frequency))

# Print separate tables for each group with sorted transitions
for from_value, transitions in grouped_transitions.items():
 print(f"Transition table for starting condition: {from_value}")
 print("From\tTo\tOcc.\tWeigth")
 
 # Sort transitions based on frequency in descending order
 sorted_transitions = sorted(transitions, key=lambda x: x[2], reverse=True)

 # Calculate total occurrences for the current table
 total_occurrences = sum(occurrence for _, _, occurrence in sorted_transitions)

 for from_val, to_val, frequency in sorted_transitions:
 # Calculate the average occurrence for each transition
 average_occurrence = frequency / total_occurrences
 print(f'{from_val}\t{to_val}\t{frequency}\t{average_occurrence:.4}')

 print('\n')

loading books ...
Finished	Number of transitions: 7678

Transition table for starting condition: S
From	To	Occ.	Weigth
S	np	223	0.5533
S	CL	180	0.4467


Transition table for starting condition: CL
From	To	Occ.	Weigth
CL	CL	743	0.2964
CL	V	425	0.1695
CL	Term	295	0.1177
CL	ADV	271	0.1081
CL	O	246	0.09813
CL	S	223	0.08895
CL	P	111	0.04428
CL	VC	104	0.04148
CL	IO	36	0.01436
CL	np	28	0.01117
CL	conj	12	0.004787
CL	advp	9	0.00359
CL	O2	4	0.001596


Transition table for starting condition: np
From	To	Occ.	Weigth
np	Term	1267	0.5599
np	np	757	0.3345
np	adjp	113	0.04993
np	CL	95	0.04198
np	advp	16	0.00707
np	pp	15	0.006628


Transition table for starting condition: VC
From	To	Occ.	Weigth
VC	vp	104	1.0


Transition table for starting condition: vp
From	To	Occ.	Weigth
vp	Term	540	0.98
vp	vp	11	0.01996


Transition table for starting condition: P
From	To	Occ.	Weigth
P	np	47	0.4234
P	pp	46	0.4144
P	adjp	18	0.1622


Transition table for starting condition: pp
From	To	Occ.	Weigth
pp	Term	228	0.479
pp