{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Text-Fabric dataset (from LowFat XML trees)\n", "\n", "
\n", " Code version: 0.6 (November 17, 2023)\n", " Data version: August 16, 2023\n", "\n", "\n", "## Table of content \n", "* 1 - Introduction\n", "* 2 - Read LowFat XML data and store in pickle\n", " * 2.1 - Required libraries\n", " * 2.2 - Import various libraries\n", " * 2.3 - Initialize global data\n", " * 2.4 - Add parent info to each node of the XML tree\n", " * 2.5 - Process the XML data and store dataframe in pickle\n", "* 3 - Production Text-Fabric from pickle input\n", " * 3.1 - Load libraries and initialize some data\n", " * 3.2 - Optionaly export to Excel for investigation\n", " * 3.3 - Running the TF walker function" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# 1 - Introduction \n", "##### [Back to TOC](#TOC)\n", "\n", "The source data for the conversion are the LowFat XML trees files representing the macula-greek version of the Nestle 1904 Greek New Testment (British Foreign Bible Society, 1904). The starting dataset is formatted according to Syntax diagram markup by the Global Bible Initiative (GBI). The most recent source data can be found on github https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/lowfat. \n", "\n", "Attribution: \"MACULA Greek Linguistic Datasets, available at https://github.com/Clear-Bible/macula-greek/\". \n", "\n", "The production of the Text-Fabric files consist of two phases. First one is the creation of piclke files (section 2). The second phase is the the actual Text-Fabric creation process (section 3). The process can be depicted as follows:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# 2 - Read LowFat XML data and store in pickle \n", "##### [Back to TOC](#TOC)\n", "\n", "This script harvests all information from the LowFat tree data (XML nodes), puts it into a Panda DataFrame and stores the result per book in a pickle file. Note: pickling (in Python) is serialising an object into a disk file (or buffer). See also the [Python3 documentation](https://docs.python.org/3/library/pickle.html).\n", "\n", "Within the context of this script, the term 'Leaf' refers to nodes that contain the Greek word as data. These nodes are also referred to as 'terminal nodes' since they do not have any children, similar to leaves on a tree. Additionally, Parent1 represents the parent of the leaf, Parent2 represents the parent of Parent1, and so on. For a visual representation, please refer to the following diagram.\n", "\n", "\n", "\n", "For a full description of the source data see document [MACULA Greek Treebank for the Nestle 1904 Greek New Testament.pdf](https://github.com/Clear-Bible/macula-greek/blob/main/doc/MACULA%20Greek%20Treebank%20for%20the%20Nestle%201904%20Greek%20New%20Testament.pdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1 Required libraries\n", "##### [Back to TOC](#TOC)\n", "\n", "The scripts in this notebook require (beside text-fabric) the following Python libraries to be installed in the environment:\n", "\n", "
\n", " pandas\n", " openpyxl\n", "\n", "\n", "You can install any missing library from within Jupyter Notebook using either `pip` or `pip3`. (eg.: !pip3 install pandas)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## 2.2 - Import various libraries\n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2022-10-28T02:58:14.739227Z", "start_time": "2022-10-28T02:57:38.766097Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import sys\n", "import os\n", "import time\n", "import pickle\n", "\n", "import re #regular expressions\n", "from os import listdir\n", "from os.path import isfile, join\n", "import xml.etree.ElementTree as ET\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 - Initialize global data\n", "##### [Back to TOC](#TOC)\n", "\n", "The following global data initializes the script, gathering the XML data to store it into the pickle files.\n", "\n", "IMPORTANT: To ensure proper creation of the Text-Fabric files on your system, it is crucial to adjust the values of BaseDir, InputDir, and OutputDir to match the location of the data and the operating system you are using. In this Jupyter Notebook, Windows is the operating system employed." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "BaseDir = 'D:\\\\TF\\\\'\n", "XmlDir = BaseDir+'xml\\\\'\n", "PklDir = BaseDir+'pkl\\\\'\n", "XlsxDir = BaseDir+'xlsx\\\\'\n", "# note: create output directory prior running this part\n", "\n", "# key: filename, [0]=book_long, [1]=book_num, [3]=book_short\n", "bo2book = {'01-matthew': ['Matthew', '1', 'Matt'],\n", " '02-mark': ['Mark', '2', 'Mark'],\n", " '03-luke': ['Luke', '3', 'Luke'],\n", " '04-john': ['John', '4', 'John'],\n", " '05-acts': ['Acts', '5', 'Acts'],\n", " '06-romans': ['Romans', '6', 'Rom'],\n", " '07-1corinthians': ['I_Corinthians', '7', '1Cor'],\n", " '08-2corinthians': ['II_Corinthians', '8', '2Cor'],\n", " '09-galatians': ['Galatians', '9', 'Gal'],\n", " '10-ephesians': ['Ephesians', '10', 'Eph'],\n", " '11-philippians': ['Philippians', '11', 'Phil'],\n", " '12-colossians': ['Colossians', '12', 'Col'],\n", " '13-1thessalonians':['I_Thessalonians', '13', '1Thess'],\n", " '14-2thessalonians':['II_Thessalonians','14', '2Thess'],\n", " '15-1timothy': ['I_Timothy', '15', '1Tim'],\n", " '16-2timothy': ['II_Timothy', '16', '2Tim'],\n", " '17-titus': ['Titus', '17', 'Titus'],\n", " '18-philemon': ['Philemon', '18', 'Phlm'],\n", " '19-hebrews': ['Hebrews', '19', 'Heb'],\n", " '20-james': ['James', '20', 'Jas'],\n", " '21-1peter': ['I_Peter', '21', '1Pet'],\n", " '22-2peter': ['II_Peter', '22', '2Pet'],\n", " '23-1john': ['I_John', '23', '1John'],\n", " '24-2john': ['II_John', '24', '2John'],\n", " '25-3john': ['III_John', '25', '3John'], \n", " '26-jude': ['Jude', '26', 'Jude'],\n", " '27-revelation': ['Revelation', '27', 'Rev']}\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.4 - Add parent info to each node of the XML tree\n", "##### [Back to TOC](#TOC)\n", "\n", "In order to be able to traverse from the 'leafs' upto the root of the tree, it is required to add information to each node pointing to the parent of each node. The terminating nodes of an XML tree are called \"leaf nodes\" or \"leaves.\" These nodes do not have any child elements and are located at the end of a branch in the XML tree. Leaf nodes contain the actual data or content within an XML document. In contrast, non-leaf nodes are called \"internal nodes,\" which have one or more child elements.\n", "\n", "(Attribution: the concept of following functions is taken from https://stackoverflow.com/questions/2170610/access-elementtree-node-parent-node)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def addParentInfo(et):\n", " for child in et:\n", " child.attrib['parent'] = et\n", " addParentInfo(child)\n", "\n", "def getParent(et):\n", " if 'parent' in et.attrib:\n", " return et.attrib['parent']\n", " else:\n", " return None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.5 - Process the XML data and store dataframe in pickle\n", "##### [Back to TOC](#TOC)\n", "\n", "This code processes books in the correct order. Firstly, it parses the XML and adds parent information to each node. Then, it loops through the nodes and checks if it is a 'leaf' node, meaning it contains only one word. If it is a 'leaf' node, the following steps are performed:\n", "\n", "* Adds computed data to the 'leaf' nodes in memory.\n", "* Traverses from the 'leaf' node up to the root and adds information from the parent, grandparent, and so on, to the 'leaf' node.\n", "* Once it reaches the root, it stops and stores all the gathered information in a dataframe that will be added to the full_dataframe.\n", "* After processing all the nodes for a specific book, the full_dataframe is exported to a pickle file specific to that book.\n", "\n", "Note that this script takes a long time to execute (due to the large number of itterations). However, once the XML data is converted to PKL, there is no need to rerun (unless the source XML data is updated)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing Acts at D:\\TF\\xml\\05-acts.xml\n", "......................................................................................................................................................................................." ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\tonyj\\AppData\\Local\\Temp\\ipykernel_21964\\1214332617.py:86: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n", " full_df = pd.concat([df for df in DataFrameList])\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Found 18393 items in 62.032474994659424 seconds\n", "\n" ] } ], "source": [ "# set some globals\n", "WordOrder=1 # stores the word order as it is found in the XML files (unique number for each word in the full corpus)\n", "CollectedItems= 0\n", "\n", "# process books in order\n", "for bo, bookinfo in bo2book.items():\n", " CollectedItems=0\n", " SentenceNumber=0\n", " WordGroupNumber=0\n", " full_df=pd.DataFrame({})\n", " book_long=bookinfo[0]\n", " booknum=bookinfo[1]\n", " book_short=bookinfo[2]\n", " InputFile = os.path.join(XmlDir, f'{bo}.xml')\n", " OutputFile = os.path.join(PklDir, f'{bo}.pkl')\n", " print(f'Processing {book_long} at {InputFile}')\n", " DataFrameList = []\n", "\n", " # Send XML document to parsing process\n", " tree = ET.parse(InputFile)\n", " # Now add all the parent info to the nodes in the xtree [important!]\n", " addParentInfo(tree.getroot())\n", " start_time = time.time()\n", " \n", " # walk over all the XML data\n", " for elem in tree.iter():\n", " if elem.tag == 'sentence':\n", " # add running number to 'sentence' tags\n", " SentenceNumber+=1\n", " elem.set('SN', SentenceNumber)\n", " # handling conditions where XML data has