{ "cells": [ { "cell_type": "markdown", "id": "1cf27c95-0b45-4d97-a62d-9950654eb386", "metadata": {}, "source": [ "# Identify punctuations (Nestle1904GBI)" ] }, { "cell_type": "markdown", "id": "1495a021-daa1-4c2e-80d5-ab7d2d75bc3f", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Table of content \n", "* 1 - Introduction\n", "* 2 - Load Text-Fabric app and data\n", "* 3 - Performing the queries\n", " * 3.1 - Frequency of punctuations in corpus\n", " * 3.2 - Explanation of the Regular Expression\n", " * 3.3 - Notes" ] }, { "cell_type": "markdown", "id": "e6830070-1e97-4bdf-aa0c-5eda4e624a84", "metadata": {}, "source": [ "# 1 - Introduction \n", "\n", "This Jupyter Notebook performs some analysis regarding the various punctuations used in the corpus." ] }, { "cell_type": "markdown", "id": "a1b900e2-995f-4f36-ad74-d821092ca02c", "metadata": {}, "source": [ "# 2 - Load Text-Fabric app and data \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 2, "id": "6bd6c621-361d-487f-a8df-c27fb1ec9de2", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 3, "id": "0071a0db-916c-4357-88bd-6b3255af0764", "metadata": {}, "outputs": [], "source": [ "# Loading the Text-Fabric code\n", "# Note: it is assumed Text-Fabric is installed in your environment.\n", "from tf.fabric import Fabric\n", "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 4, "id": "ed76db5d-5463-4bf1-99ca-7f14b3a0f277", "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Status: latest release online 0.2 versus v0.1.1 locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The requested data is not available offline\n", "\t~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.3 not found\n" ] }, { "data": { "text/html": [ "Status: latest release online 0.2 versus None locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.3" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " Text-Fabric: Text-Fabric API 11.4.10, tonyjurg/Nestle1904GBI/app v3, Search Reference
\n", " Data: tonyjurg - Nestle1904GBI 0.3, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots/node% coverage
book275102.93100
chapter260529.92100
sentence572024.09100
verse794417.34100
clause161248.54100
phrase735471.87100
word1377791.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Nestle 1904 (GBI Nodes)\n", "
\n", "\n", "
\n", "
\n", "after\n", "
\n", "
str
\n", "\n", " Chararcter after the word (space or punctuation)\n", "\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " Book name (fully spelled out)\n", "\n", "
\n", "\n", "
\n", "
\n", "booknum\n", "
\n", "
int
\n", "\n", " NT book number (Matthew=1, Mark=2, ..., Revelation=27)\n", "\n", "
\n", "\n", "
\n", "
\n", "bookshort\n", "
\n", "
str
\n", "\n", " Book name (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "case\n", "
\n", "
str
\n", "\n", " Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " Chapter number inside book\n", "\n", "
\n", "\n", "
\n", "
\n", "clause\n", "
\n", "
int
\n", "\n", " Clause number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "clauserule\n", "
\n", "
str
\n", "\n", " Clause rule\n", "\n", "
\n", "\n", "
\n", "
\n", "clausetype\n", "
\n", "
str
\n", "\n", " Clause type\n", "\n", "
\n", "\n", "
\n", "
\n", "degree\n", "
\n", "
str
\n", "\n", " Degree (e.g. Comparitative, Superlative)\n", "\n", "
\n", "\n", "
\n", "
\n", "formaltag\n", "
\n", "
str
\n", "\n", " Formal tag (Sandborg-Petersen morphology)\n", "\n", "
\n", "\n", "
\n", "
\n", "functionaltag\n", "
\n", "
str
\n", "\n", " Functional tag (Sandborg-Petersen morphology)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss_EN\n", "
\n", "
str
\n", "\n", " English gloss\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " Gramatical gender (Masculine, Feminine, Neuter)\n", "\n", "
\n", "\n", "
\n", "
\n", "lemma\n", "
\n", "
str
\n", "\n", " Lexeme (lemma)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_dom\n", "
\n", "
str
\n", "\n", " Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "ln\n", "
\n", "
str
\n", "\n", " Lauw-Nida lexical classification (not present everywhere)\n", "\n", "
\n", "\n", "
\n", "
\n", "monad\n", "
\n", "
int
\n", "\n", " Sequence number of the smallest meaningful unit of text (single word)\n", "\n", "
\n", "\n", "
\n", "
\n", "mood\n", "
\n", "
str
\n", "\n", " Gramatical mood of the verb (passive, etc)\n", "\n", "
\n", "\n", "
\n", "
\n", "nodeID\n", "
\n", "
str
\n", "\n", " Node ID (as in the XML source data)\n", "\n", "
\n", "\n", "
\n", "
\n", "normalized\n", "
\n", "
str
\n", "\n", " Surface word stripped of punctations\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " Gramatical number (Singular, Plural)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
str
\n", "\n", " Gramatical number of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "person\n", "
\n", "
str
\n", "\n", " Gramatical person of the verb (first, second, third)\n", "\n", "
\n", "\n", "
\n", "
\n", "phrase\n", "
\n", "
int
\n", "\n", " Phrase number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "phrasefunction\n", "
\n", "
str
\n", "\n", " Phrase function (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "phrasefunctionlong\n", "
\n", "
str
\n", "\n", " Phrase function (long description)\n", "\n", "
\n", "\n", "
\n", "
\n", "phrasetype\n", "
\n", "
str
\n", "\n", " Phrase type information\n", "\n", "
\n", "\n", "
\n", "
\n", "sentence\n", "
\n", "
int
\n", "\n", " Sentence number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " Speech Part (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "splong\n", "
\n", "
str
\n", "\n", " Speech Part (long description)\n", "\n", "
\n", "\n", "
\n", "
\n", "strongs\n", "
\n", "
str
\n", "\n", " Strongs number\n", "\n", "
\n", "\n", "
\n", "
\n", "subj_ref\n", "
\n", "
str
\n", "\n", " Subject reference (to nodeID in XML source data)\n", "\n", "
\n", "\n", "
\n", "
\n", "tense\n", "
\n", "
str
\n", "\n", " Gramatical tense of the verb (e.g. Present, Aorist)\n", "\n", "
\n", "\n", "
\n", "
\n", "type\n", "
\n", "
str
\n", "\n", " Gramatical type of noun or pronoun (e.g. Common, Personal)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " Verse number inside chapter\n", "\n", "
\n", "\n", "
\n", "
\n", "voice\n", "
\n", "
str
\n", "\n", " Gramatical voice of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "word\n", "
\n", "
str
\n", "\n", " Word as it appears in the text\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load the app and data\n", "N1904 = use (\"tonyjurg/Nestle1904GBI:latest\", hoist=globals())" ] }, { "cell_type": "markdown", "id": "58ef1678-a19d-4c0c-80f3-84f8471a90e2", "metadata": { "tags": [] }, "source": [ "# 3 - Performing the queries " ] }, { "cell_type": "markdown", "id": "211b2bde-002b-4243-87c9-4bd850868354", "metadata": { "jupyter": { "outputs_hidden": true }, "tags": [] }, "source": [ "## 3.1 - Frequency of punctuations in corpus \n", "##### [Back to TOC](#TOC)\n", "\n", "This code generates a table that displays the frequency of punctuations behind words within the Text-Fabric corpus. The API call C.characters.data retrieves the data in the form of a Python dictionary. The subsequent code unpacks and sorts this dictionary to present the table. It's important to note that since the query is based on the 'word' feature, there are no spaces behind the words." ] }, { "cell_type": "code", "execution_count": 5, "id": "b8e8ce2d-43db-48dd-ace9-2156c7046692", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.12s 18507 results\n", "╒═══════════════╤═════════════╕\n", "│ Punctuation │ Frequency │\n", "╞═══════════════╪═════════════╡\n", "│ . │ 5712 │\n", "├───────────────┼─────────────┤\n", "│ , │ 9441 │\n", "├───────────────┼─────────────┤\n", "│ · │ 2355 │\n", "├───────────────┼─────────────┤\n", "│ ; │ 969 │\n", "├───────────────┼─────────────┤\n", "│ — │ 30 │\n", "╘═══════════════╧═════════════╛\n" ] } ], "source": [ "# Library to format table\n", "from tabulate import tabulate\n", "\n", "# The actual query (see section 3.2 about the used RegExp in this query)\n", "SearchPunctuations = '''\n", "word word~([\\.·—,;])$\n", "'''\n", "PunctuationList = N1904.search(SearchPunctuations)\n", "\n", "ResultDict = {}\n", "for tuple in PunctuationList:\n", " node=tuple[0]\n", " Punctuation=F.word.v(node)[-1] \n", " # Check if this Punctuation already exists in ResultDict\n", " if Punctuation in ResultDict:\n", " # If it exists, add the count to the existing value\n", " ResultDict[Punctuation]+=1\n", " else:\n", " # If it doesn't exist, initialize the count as the value\n", " ResultDict[Punctuation]=1\n", "\n", "# Convert the dictionary into a list of key-value pairs\n", "TableData = [[key, value] for key, value in ResultDict.items()]\n", "\n", "# Produce the table\n", "headers = [\"Punctuation\",\"Frequency\"]\n", "print(tabulate(TableData, headers=headers, tablefmt='fancy_grid'))\n" ] }, { "cell_type": "markdown", "id": "1c71a7cd", "metadata": {}, "source": [ "## 3.2 Explanation of the Regular Expression \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "markdown", "id": "01d81446", "metadata": {}, "source": [ "The regular expression `[\\.·—,;]$` matches any one character from the set containing `.`, `·`, `—`, `,`, or `;`. The `$` anchor ensures that this character is at the end of the string. Hence, the regular expression will only be true if any of these characters is found at the last position of a word node. If the `$` anchor is omitted, there might be false positives due to the existence of 16 word nodes that start with the character `—`. " ] }, { "cell_type": "markdown", "id": "009becf6", "metadata": {}, "source": [ "## 3.3 Note \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "markdown", "id": "af76e60d", "metadata": {}, "source": [ "Starting from version 0.3, thi Text-Fabric dataset will include a new feature called 'after'. This feature aims to enhance the presentation of the data by providing information about the punctuations that come after a particular word." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }