{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating Text-Fabric dataset (from LowFat XML trees)\n", "\n", "
\n", " Code version: 0.7 (February 20, 2024)\n", " Data version: February 10, 2024 (Readme)\n", "\n", "\n", "## Table of content \n", "* 1 - Introduction\n", "* 2 - Read LowFat XML data and store in pickle\n", " * 2.1 - Required libraries\n", " * 2.2 - Import various libraries\n", " * 2.3 - Initialize global data\n", " * 2.4 - Process the XML data and store dataframe in pickle\n", "* 3 - Optionaly export to aid investigation\n", " * 3.1 - Export to Excel format \n", " * 3.2 - Export to CSV format\n", "* 4 - Text-Fabric dataset production from pickle input\n", " * 4.1 - Explanation\n", " * 4.2 - Running the TF walker function\n", "* 5 - Housekeeping\n", " * 5.1 - Optionaly zip-up the pickle files\n", " * 5.2 - Publishing on gitHub" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# 1 - Introduction \n", "##### [Back to TOC](#TOC)\n", "\n", "The source data for the conversion are the LowFat XML trees files representing the macula-greek version of the Nestle 1904 Greek New Testment (British Foreign Bible Society, 1904). The starting dataset is formatted according to Syntax diagram markup by the Global Bible Initiative (GBI). The most recent source data can be found on github https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/lowfat. \n", "\n", "Attribution: \"MACULA Greek Linguistic Datasets, available at https://github.com/Clear-Bible/macula-greek/\". \n", "\n", "The production of the Text-Fabric files consist of two phases. First one is the creation of piclke files (section 2). The second phase is the the actual Text-Fabric creation process (section 3). The process can be depicted as follows:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# 2 - Read LowFat XML data and store in pickle \n", "##### [Back to TOC](#TOC)\n", "\n", "This script harvests all information from the LowFat tree data (XML nodes), puts it into a Panda DataFrame and stores the result per book in a pickle file. Note: pickling (in Python) is serialising an object into a disk file (or buffer). See also the [Python3 documentation](https://docs.python.org/3/library/pickle.html).\n", "\n", "Within the context of this script, the term 'Leaf' refers to nodes that contain the Greek word as data. These nodes are also referred to as 'terminal nodes' since they do not have any children, similar to leaves on a tree. Additionally, Parent1 represents the parent of the leaf, Parent2 represents the parent of Parent1, and so on. For a visual representation, please refer to the following diagram.\n", "\n", "\n", "\n", "For a full description of the source data see document [MACULA Greek Treebank for the Nestle 1904 Greek New Testament.pdf](https://github.com/Clear-Bible/macula-greek/blob/main/doc/MACULA%20Greek%20Treebank%20for%20the%20Nestle%201904%20Greek%20New%20Testament.pdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1 Required libraries\n", "##### [Back to TOC](#TOC)\n", "\n", "The scripts in this notebook require (beside text-fabric) a number of Python libraries to be installed in the environment (see following section).\n", "You can install any missing library from within Jupyter Notebook using either `pip` or `pip3`. (eg.: !pip3 install pandas)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## 2.2 - Import various libraries\n", "##### [Back to TOC](#TOC)\n", "\n", "The following cell reads all required libraries by the scripts in this notebook." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2022-10-28T02:58:14.739227Z", "start_time": "2022-10-28T02:57:38.766097Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import sys # System\n", "import os # Operating System\n", "from os import listdir\n", "from os.path import isfile, join\n", "import time\n", "import pickle\n", "import re # Regular Expressions\n", "from lxml import etree as ET\n", "from tf.fabric import Fabric\n", "from tf.convert.walker import CV\n", "from tf.parameters import VERSION\n", "from datetime import date\n", "import pickle\n", "import unicodedata\n", "from unidecode import unidecode\n", "import openpyxl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 - Initialize global data\n", "##### [Back to TOC](#TOC)\n", "\n", "The following cell initializes the global data used by the various scripts in this notebook. Many of these global variables are shared among the scripts as they relate to common entities.\n", "\n", "IMPORTANT: To ensure proper creation of the Text-Fabric files on your system, it is crucial to adjust the values of BaseDir, XmlDir, etc. to match the location of the data and the operating system you are using. In this Jupyter Notebook, Windows is the operating system employed." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# set script version number\n", "scriptVersion='0.7'\n", "scriptDate='February 20, 2024'\n", "\n", "# Define the source and destination locations \n", "BaseDir = '..\\\\'\n", "XmlDir = BaseDir+'xml\\\\20240210\\\\'\n", "PklDir = BaseDir+'pickle\\\\20240210\\\\'\n", "XlsxDir = BaseDir+'excel\\\\20240210\\\\'\n", "CsvDir = BaseDir+'csv\\\\20240210\\\\'\n", "# note: create output directory prior running the scripts!\n", "\n", "# key: filename, [0]=bookLong, [1]=bookNum, [3]=bookShort\n", "bo2book = {'01-matthew': ['Matthew', '1', 'Matt'],\n", " '02-mark': ['Mark', '2', 'Mark'],\n", " '03-luke': ['Luke', '3', 'Luke'],\n", " '04-john': ['John', '4', 'John'],\n", " '05-acts': ['Acts', '5', 'Acts'],\n", " '06-romans': ['Romans', '6', 'Rom'],\n", " '07-1corinthians': ['I_Corinthians', '7', '1Cor'],\n", " '08-2corinthians': ['II_Corinthians', '8', '2Cor'],\n", " '09-galatians': ['Galatians', '9', 'Gal'],\n", " '10-ephesians': ['Ephesians', '10', 'Eph'],\n", " '11-philippians': ['Philippians', '11', 'Phil'],\n", " '12-colossians': ['Colossians', '12', 'Col'],\n", " '13-1thessalonians':['I_Thessalonians', '13', '1Thess'],\n", " '14-2thessalonians':['II_Thessalonians','14', '2Thess'],\n", " '15-1timothy': ['I_Timothy', '15', '1Tim'],\n", " '16-2timothy': ['II_Timothy', '16', '2Tim'],\n", " '17-titus': ['Titus', '17', 'Titus'],\n", " '18-philemon': ['Philemon', '18', 'Phlm'],\n", " '19-hebrews': ['Hebrews', '19', 'Heb'],\n", " '20-james': ['James', '20', 'Jas'],\n", " '21-1peter': ['I_Peter', '21', '1Pet'],\n", " '22-2peter': ['II_Peter', '22', '2Pet'],\n", " '23-1john': ['I_John', '23', '1John'],\n", " '24-2john': ['II_John', '24', '2John'],\n", " '25-3john': ['III_John', '25', '3John'], \n", " '26-jude': ['Jude', '26', 'Jude'],\n", " '27-revelation': ['Revelation', '27', 'Rev']}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.4 - Process the XML data and store dataframe in pickle\n", "##### [Back to TOC](#TOC)\n", "\n", "This code processes all 27 books in the correct order.\n", "For each book, the following is done:\n", "\n", "* create a parent-child map based upon the XML source (function buildParentMap).\n", "* loop trough the XML source to identify 'leaf' nodes and gather information regarding all its parents (function processElement) and store the results in a datalist.\n", "* After processing all the nodes the datalist is converted to a datframe and exported as a pickle file specific to that book.\n", "\n", "Once the XML data is converted to PKL files, there is no need to rerun (unless the source XML data is updated).\n", "\n", "Since the size of the pickle files can be rather large, it is advised to add the .pkl extention to the ignore list of gitHub (.gitignore)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Extract data from XML files and store it in pickle files\n", "\tProcessing Matthew at ..\\xml\\20240210\\01-matthew.xml Found 18299 items in 1.42 seconds.\n", "\tProcessing Mark at ..\\xml\\20240210\\02-mark.xml Found 11277 items in 0.95 seconds.\n", "\tProcessing Luke at ..\\xml\\20240210\\03-luke.xml Found 19456 items in 4.10 seconds.\n", "\tProcessing John at ..\\xml\\20240210\\04-john.xml Found 15643 items in 1.24 seconds.\n", "\tProcessing Acts at ..\\xml\\20240210\\05-acts.xml Found 18393 items in 1.59 seconds.\n", "\tProcessing Romans at ..\\xml\\20240210\\06-romans.xml Found 7100 items in 0.66 seconds.\n", "\tProcessing I_Corinthians at ..\\xml\\20240210\\07-1corinthians.xml Found 6820 items in 0.58 seconds.\n", "\tProcessing II_Corinthians at ..\\xml\\20240210\\08-2corinthians.xml Found 4469 items in 0.43 seconds.\n", "\tProcessing Galatians at ..\\xml\\20240210\\09-galatians.xml Found 2228 items in 0.29 seconds.\n", "\tProcessing Ephesians at ..\\xml\\20240210\\10-ephesians.xml Found 2419 items in 0.30 seconds.\n", "\tProcessing Philippians at ..\\xml\\20240210\\11-philippians.xml Found 1630 items in 0.21 seconds.\n", "\tProcessing Colossians at ..\\xml\\20240210\\12-colossians.xml Found 1575 items in 0.23 seconds.\n", "\tProcessing I_Thessalonians at ..\\xml\\20240210\\13-1thessalonians.xml Found 1473 items in 0.14 seconds.\n", "\tProcessing II_Thessalonians at ..\\xml\\20240210\\14-2thessalonians.xml Found 822 items in 0.11 seconds.\n", "\tProcessing I_Timothy at ..\\xml\\20240210\\15-1timothy.xml Found 1588 items in 0.18 seconds.\n", "\tProcessing II_Timothy at ..\\xml\\20240210\\16-2timothy.xml Found 1237 items in 0.28 seconds.\n", "\tProcessing Titus at ..\\xml\\20240210\\17-titus.xml Found 658 items in 0.14 seconds.\n", "\tProcessing Philemon at ..\\xml\\20240210\\18-philemon.xml Found 335 items in 0.20 seconds.\n", "\tProcessing Hebrews at ..\\xml\\20240210\\19-hebrews.xml Found 4955 items in 0.42 seconds.\n", "\tProcessing James at ..\\xml\\20240210\\20-james.xml Found 1739 items in 0.35 seconds.\n", "\tProcessing I_Peter at ..\\xml\\20240210\\21-1peter.xml Found 1676 items in 0.38 seconds.\n", "\tProcessing II_Peter at ..\\xml\\20240210\\22-2peter.xml Found 1098 items in 0.21 seconds.\n", "\tProcessing I_John at ..\\xml\\20240210\\23-1john.xml Found 2136 items in 0.33 seconds.\n", "\tProcessing II_John at ..\\xml\\20240210\\24-2john.xml Found 245 items in 0.14 seconds.\n", "\tProcessing III_John at ..\\xml\\20240210\\25-3john.xml Found 219 items in 0.13 seconds.\n", "\tProcessing Jude at ..\\xml\\20240210\\26-jude.xml Found 457 items in 0.13 seconds.\n", "\tProcessing Revelation at ..\\xml\\20240210\\27-revelation.xml Found 9832 items in 0.87 seconds.\n", "Finished in 18.58 seconds.\n" ] } ], "source": [ "# Create the pickle files\n", "\n", "# Set global variables for this script\n", "WordOrder = 1\n", "CollectedItems = 0\n", "\n", "###############################################\n", "# The helper functions #\n", "###############################################\n", "\n", "def buildParentMap(tree):\n", " \"\"\"\n", " Builds a mapping of child elements to their parent elements in an XML tree.\n", " This function is useful for cases where you need to navigate from a child element\n", " up to its parent element, as the ElementTree API does not provide this functionality directly.\n", "\n", " Parameters:\n", " tree (ElementTree): An XML ElementTree object.\n", "\n", " Returns:\n", " dict: A dictionary where keys are child elements and values are their respective parent elements.\n", " \n", " Usage:\n", " To build the map:\n", " tree = ET.parse(InputFile)\n", " parentMap = buildParentMap(tree)\n", " Then, whenever you need a parent of an element:\n", " parent = getParent(someElement, parentMap)\n", " \n", " \"\"\"\n", " return {c: p for p in tree.iter() for c in p}\n", "\n", "def getParent(et, parentMap):\n", " \"\"\"\n", " Retrieves the parent element of a given element from the parent map.\n", "\n", " Parameters:\n", " et (Element): The XML element whose parent is to be found.\n", " parentMap (dict): A dictionary mapping child elements to their parents.\n", "\n", " Returns:\n", " Element: The parent element of the given element. Returns None if the parent is not found.\n", " \"\"\"\n", " return parentMap.get(et)\n", "\n", "def processElement(elem, bookInfo, WordOrder, parentMap):\n", " \"\"\"\n", " Processes an XML element to extract and augment its attributes with additional data.\n", " This function adds new attributes to an element and modifies existing ones based on the provided\n", " book information, word order, and parent map. It also collects hierarchical information\n", " about the element's ancestors in the XML structure.\n", "\n", " Parameters:\n", " elem (Element): The XML element to be processed.\n", " bookInfo (tuple): A tuple containing information about the book (long name, book number, short name).\n", " WordOrder (int): The order of the word in the current processing context.\n", " parentMap (dict): A dictionary mapping child elements to their parents.\n", "\n", " Returns:\n", " tuple: A tuple containing the updated attributes of the element and the next word order.\n", " \"\"\"\n", " global CollectedItems\n", " LeafRef = re.sub(r'[!: ]', \" \", elem.attrib.get('ref')).split()\n", " elemAttrib = dict(elem.attrib) # Create a copy of the attributes using dict()\n", "\n", " # Adding new or modifying existing attributes\n", " elemAttrib.update({\n", " 'wordOrder': WordOrder,\n", " 'LeafName': elem.tag,\n", " 'word': elem.text,\n", " 'bookLong': bookInfo[0],\n", " 'bookNum': int(bookInfo[1]),\n", " 'bookShort': bookInfo[2],\n", " 'chapter': int(LeafRef[1]),\n", " 'verse': int(LeafRef[2]),\n", " 'parents': 0 # Initialize 'parents' attribute\n", " })\n", "\n", " parentnode = getParent(elem, parentMap)\n", " index = 0\n", " while parentnode is not None:\n", " index += 1\n", " parent_attribs = {\n", " f'Parent{index}Name': parentnode.tag,\n", " f'Parent{index}Type': parentnode.attrib.get('type'),\n", " f'Parent{index}Appos': parentnode.attrib.get('appositioncontainer'),\n", " f'Parent{index}Class': parentnode.attrib.get('class'),\n", " f'Parent{index}Rule': parentnode.attrib.get('rule'),\n", " f'Parent{index}Role': parentnode.attrib.get('role'),\n", " f'Parent{index}Cltype': parentnode.attrib.get('cltype'),\n", " f'Parent{index}Unit': parentnode.attrib.get('unit'),\n", " f'Parent{index}Junction': parentnode.attrib.get('junction'),\n", " f'Parent{index}SN': parentnode.attrib.get('SN'),\n", " f'Parent{index}WGN': parentnode.attrib.get('WGN')\n", " }\n", " elemAttrib.update(parent_attribs)\n", " parentnode = getParent(parentnode, parentMap)\n", "\n", " elemAttrib['parents'] = index\n", "\n", " CollectedItems += 1\n", " return elemAttrib, WordOrder + 1\n", "\n", "def fixAttributeId(tree):\n", " \"\"\"\n", " Renames attributes in an XML tree that match the pattern '{*}id' to 'id'.\n", "\n", " Parameters:\n", " tree (lxml.etree._ElementTree): The XML tree to be processed.\n", "\n", " Returns:\n", " None: The function modifies the tree in-place and does not return anything.\n", " \"\"\"\n", " # Regex pattern to match attributes like '{...}id'\n", " pattern = re.compile(r'\\{.*\\}id')\n", " for element in tree.iter():\n", " attributes_to_rename = [attr for attr in element.attrib if pattern.match(attr)]\n", " for attr in attributes_to_rename:\n", " element.attrib['id'] = element.attrib.pop(attr)\n", "\n", "###############################################\n", "# The main routine #\n", "###############################################\n", "\n", "# Process books\n", "print ('Extract data from XML files and store it in pickle files')\n", "overalTime = time.time()\n", "for bo, bookInfo in bo2book.items():\n", " CollectedItems = 0\n", " SentenceNumber = 0\n", " WordGroupNumber = 0\n", " dataList = [] # List to store data dictionaries\n", "\n", " InputFile = os.path.join(XmlDir, f'{bo}.xml')\n", " OutputFile = os.path.join(PklDir, f'{bo}.pkl')\n", " print(f'\\tProcessing {bookInfo[0]} at {InputFile} ', end='')\n", "\n", " try:\n", " tree = ET.parse(InputFile)\n", " fixAttributeId(tree)\n", " parentMap = buildParentMap(tree)\n", " except Exception as e:\n", " print(f\"Error parsing XML file {InputFile}: {e}\")\n", " continue\n", "\n", " start_time = time.time()\n", "\n", " for elem in tree.iter():\n", " if elem.tag == 'sentence':\n", " SentenceNumber += 1\n", " elem.set('SN', str(SentenceNumber))\n", " elif elem.tag == 'error': # workaround for one
Name | \n", "# of nodes | \n", "# slots / node | \n", "% coverage | \n", "
---|---|---|---|
book | \n", "27 | \n", "5102.93 | \n", "100 | \n", "
chapter | \n", "260 | \n", "529.92 | \n", "100 | \n", "
verse | \n", "7943 | \n", "17.35 | \n", "100 | \n", "
sentence | \n", "8011 | \n", "17.20 | \n", "100 | \n", "
wg | \n", "105430 | \n", "6.85 | \n", "524 | \n", "
word | \n", "137779 | \n", "1.00 | \n", "100 | \n", "
3
tonyjurg/Nestle1904LFT
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
g28423636826427b12ab3a8d2a3f19d1281f102d2
''
orig_order
verse
book
chapter
none
unknown
NA
''
0
text-orig-full
https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/
about
https://github.com/tonyjurg/Nestle1904LFT
https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/features/<feature>.md
layout-orig-full
}True
local
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
Nestle 1904 (Low Fat Tree)
10.5281/zenodo.10182594
tonyjurg
/tf
Nestle1904LFT
Nestle1904LFT
0.7
https://learner.bible/text/show_text/nestle1904/
Show this on the Bible Online Learner website
en
https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
{webBase}/word?version={version}&id=<lid>
v0.6.3
True
True
{book}
''
True
True
{chapter}
''
0
#{sentence} (start: {book} {chapter}:{headverse})
''
True
chapter verse
{book} {chapter}:{verse}
''
0
#{wgnum}: {wgtype} {wgclass} {clausetype} {wgrole} {wgrule} {junction}
''
True
lemma
gloss
chapter verse
grc