{ "cells": [ { "cell_type": "markdown", "id": "1cf27c95-0b45-4d97-a62d-9950654eb386", "metadata": {}, "source": [ "# Identify punctuations (Nestle1904LFT)" ] }, { "cell_type": "markdown", "id": "1495a021-daa1-4c2e-80d5-ab7d2d75bc3f", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Table of content \n", "* 1 - Introduction\n", "* 2 - Load Text-Fabric app and data\n", "* 3 - Performing the queries\n", " * 3.1 - Frequency of punctuations in corpus\n", " * 3.2 - Explanation of the Regular Expression" ] }, { "cell_type": "markdown", "id": "e6830070-1e97-4bdf-aa0c-5eda4e624a84", "metadata": {}, "source": [ "# 1 - Introduction \n", "\n", "This Jupyter Notebook performs some analysis regarding the various punctuations used in the corpus." ] }, { "cell_type": "markdown", "id": "a1b900e2-995f-4f36-ad74-d821092ca02c", "metadata": {}, "source": [ "# 2 - Load Text-Fabric app and data \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "code", "execution_count": 1, "id": "6bd6c621-361d-487f-a8df-c27fb1ec9de2", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "0071a0db-916c-4357-88bd-6b3255af0764", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Loading the Text-Fabric code\n", "# Note: it is assumed Text-Fabric is installed in your environment.\n", "from tf.fabric import Fabric\n", "from tf.app import use" ] }, { "cell_type": "code", "execution_count": 3, "id": "ed76db5d-5463-4bf1-99ca-7f14b3a0f277", "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/markdown": [ "**Locating corpus resources ...**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Status: latest release online v0.5 versus v03 locally" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "downloading app, main data and requested additions ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "app: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " | 0.21s T otype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 2.46s T oslots from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.61s T unicode from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.48s T verse from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.50s T chapter from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.57s T wordtranslit from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.60s T word from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.59s T normalized from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.58s T wordunacc from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.50s T book from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.50s T after from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | | 0.06s C __levels__ from otype, oslots, otext\n", " | | 1.83s C __order__ from otype, oslots, __levels__\n", " | | 0.07s C __rank__ from otype, __order__\n", " | | 3.93s C __levUp__ from otype, oslots, __rank__\n", " | | 2.16s C __levDown__ from otype, __levUp__, __rank__\n", " | | 0.21s C __characters__ from otext\n", " | | 0.94s C __boundary__ from otype, oslots, __rank__\n", " | | 0.04s C __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse\n", " | | 0.23s C __structure__ from otype, oslots, otext, __rank__, __levUp__, book, chapter, verse\n", " | 0.36s T appos from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.43s T booknumber from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.49s T bookshort from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.49s T case from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.34s T clausetype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.55s T containedclause from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.41s T degree from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.56s T gloss from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.46s T gn from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.35s T junction from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.56s T lemma from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.51s T lex_dom from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.53s T ln from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.41s T markafter from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.41s T markbefore from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.41s T markorder from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.45s T monad from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.43s T mood from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.53s T morph from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.53s T nodeID from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.49s T nu from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.50s T number from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.46s T orig_order from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.45s T person from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.44s T punctuation from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.67s T ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.52s T roleclausedistance from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.45s T sentence from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.52s T sp from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.50s T sp_full from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.54s T strongs from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.43s T subj_ref from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.43s T tense from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.45s T type from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.43s T voice from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.40s T wgclass from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.36s T wglevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.38s T wgnum from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.37s T wgrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.37s T wgrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.41s T wgrule from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.35s T wgtype from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.51s T wordlevel from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.49s T wordrole from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n", " | 0.50s T wordrolelong from ~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.5\n" ] }, { "data": { "text/html": [ "\n", " TF: TF API 12.1.5, tonyjurg/Nestle1904LFT/app v3, Search Reference
\n", " Data: tonyjurg - Nestle1904LFT 0.5, Character table, Feature docs
\n", "
Node types\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "
Name# of nodes# slots / node% coverage
book275102.93100
chapter260529.92100
verse794317.35100
sentence801117.20100
wg1134477.58624
word1377791.00100
\n", " Sets: no custom sets
\n", " Features:
\n", "
Nestle 1904 (Low Fat Tree)\n", "
\n", "\n", "
\n", "
\n", "after\n", "
\n", "
str
\n", "\n", " Characters (eg. punctuations) following the word\n", "\n", "
\n", "\n", "
\n", "
\n", "appos\n", "
\n", "
str
\n", "\n", " Apposition details\n", "\n", "
\n", "\n", "
\n", "
\n", "book\n", "
\n", "
str
\n", "\n", " Book name\n", "\n", "
\n", "\n", "
\n", "
\n", "booknumber\n", "
\n", "
int
\n", "\n", " NT book number (Matthew=1, Mark=2, ..., Revelation=27)\n", "\n", "
\n", "\n", "
\n", "
\n", "bookshort\n", "
\n", "
str
\n", "\n", " Book name (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "case\n", "
\n", "
str
\n", "\n", " Gramatical case (Nominative, Genitive, Dative, Accusative, Vocative)\n", "\n", "
\n", "\n", "
\n", "
\n", "chapter\n", "
\n", "
int
\n", "\n", " Chapter number inside book\n", "\n", "
\n", "\n", "
\n", "
\n", "clausetype\n", "
\n", "
str
\n", "\n", " Clause type details\n", "\n", "
\n", "\n", "
\n", "
\n", "containedclause\n", "
\n", "
str
\n", "\n", " Contained clause (WG number)\n", "\n", "
\n", "\n", "
\n", "
\n", "degree\n", "
\n", "
str
\n", "\n", " Degree (e.g. Comparitative, Superlative)\n", "\n", "
\n", "\n", "
\n", "
\n", "gloss\n", "
\n", "
str
\n", "\n", " English gloss\n", "\n", "
\n", "\n", "
\n", "
\n", "gn\n", "
\n", "
str
\n", "\n", " Gramatical gender (Masculine, Feminine, Neuter)\n", "\n", "
\n", "\n", "
\n", "
\n", "junction\n", "
\n", "
str
\n", "\n", " Junction data related to a wordgroup\n", "\n", "
\n", "\n", "
\n", "
\n", "lemma\n", "
\n", "
str
\n", "\n", " Lexeme (lemma)\n", "\n", "
\n", "\n", "
\n", "
\n", "lex_dom\n", "
\n", "
str
\n", "\n", " Lexical domain according to Semantic Dictionary of Biblical Greek, SDBG (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "ln\n", "
\n", "
str
\n", "\n", " Lauw-Nida lexical classification (not present everywhere?)\n", "\n", "
\n", "\n", "
\n", "
\n", "markafter\n", "
\n", "
str
\n", "\n", " Text critical marker after word\n", "\n", "
\n", "\n", "
\n", "
\n", "markbefore\n", "
\n", "
str
\n", "\n", " Text critical marker before word\n", "\n", "
\n", "\n", "
\n", "
\n", "markorder\n", "
\n", "
str
\n", "\n", " Order of punctuation and text critical marker\n", "\n", "
\n", "\n", "
\n", "
\n", "monad\n", "
\n", "
int
\n", "\n", " Monad (word order in the corpus)\n", "\n", "
\n", "\n", "
\n", "
\n", "mood\n", "
\n", "
str
\n", "\n", " Gramatical mood of the verb (passive, etc)\n", "\n", "
\n", "\n", "
\n", "
\n", "morph\n", "
\n", "
str
\n", "\n", " Morphological tag (Sandborg-Petersen morphology)\n", "\n", "
\n", "\n", "
\n", "
\n", "nodeID\n", "
\n", "
str
\n", "\n", " Node ID (as in the XML source data, not yet post-processes)\n", "\n", "
\n", "\n", "
\n", "
\n", "normalized\n", "
\n", "
str
\n", "\n", " Surface word with accents normalized and trailing punctuations removed\n", "\n", "
\n", "\n", "
\n", "
\n", "nu\n", "
\n", "
str
\n", "\n", " Gramatical number (Singular, Plural)\n", "\n", "
\n", "\n", "
\n", "
\n", "number\n", "
\n", "
str
\n", "\n", " Gramatical number of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "orig_order\n", "
\n", "
int
\n", "\n", " Word order (in source XML file)\n", "\n", "
\n", "\n", "
\n", "
\n", "otype\n", "
\n", "
str
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "person\n", "
\n", "
str
\n", "\n", " Gramatical person of the verb (first, second, third)\n", "\n", "
\n", "\n", "
\n", "
\n", "punctuation\n", "
\n", "
str
\n", "\n", " Punctuation after word\n", "\n", "
\n", "\n", "
\n", "
\n", "ref\n", "
\n", "
str
\n", "\n", " ref ID\n", "\n", "
\n", "\n", "
\n", "
\n", "roleclausedistance\n", "
\n", "
str
\n", "\n", " Distance to wordgroup defining the role of this word\n", "\n", "
\n", "\n", "
\n", "
\n", "sentence\n", "
\n", "
int
\n", "\n", " Sentence number (counted per chapter)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp\n", "
\n", "
str
\n", "\n", " Part of Speech (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "sp_full\n", "
\n", "
str
\n", "\n", " Part of Speech (long description)\n", "\n", "
\n", "\n", "
\n", "
\n", "strongs\n", "
\n", "
str
\n", "\n", " Strongs number\n", "\n", "
\n", "\n", "
\n", "
\n", "subj_ref\n", "
\n", "
str
\n", "\n", " Subject reference (to nodeID in XML source data, not yet post-processes)\n", "\n", "
\n", "\n", "
\n", "
\n", "tense\n", "
\n", "
str
\n", "\n", " Gramatical tense of the verb (e.g. Present, Aorist)\n", "\n", "
\n", "\n", "
\n", "
\n", "type\n", "
\n", "
str
\n", "\n", " Gramatical type of noun or pronoun (e.g. Common, Personal)\n", "\n", "
\n", "\n", "
\n", "
\n", "unicode\n", "
\n", "
str
\n", "\n", " Word as it arears in the text in Unicode (incl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "verse\n", "
\n", "
int
\n", "\n", " Verse number inside chapter\n", "\n", "
\n", "\n", "
\n", "
\n", "voice\n", "
\n", "
str
\n", "\n", " Gramatical voice of the verb\n", "\n", "
\n", "\n", "
\n", "
\n", "wgclass\n", "
\n", "
str
\n", "\n", " Class of the wordgroup ()\n", "\n", "
\n", "\n", "
\n", "
\n", "wglevel\n", "
\n", "
int
\n", "\n", " Number of parent wordgroups for a wordgroup\n", "\n", "
\n", "\n", "
\n", "
\n", "wgnum\n", "
\n", "
int
\n", "\n", " Wordgroup number (counted per book)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrole\n", "
\n", "
str
\n", "\n", " Role of the wordgroup (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrolelong\n", "
\n", "
str
\n", "\n", " Role of the wordgroup (full)\n", "\n", "
\n", "\n", "
\n", "
\n", "wgrule\n", "
\n", "
str
\n", "\n", " Wordgroup rule information\n", "\n", "
\n", "\n", "
\n", "
\n", "wgtype\n", "
\n", "
str
\n", "\n", " Wordgroup type details\n", "\n", "
\n", "\n", "
\n", "
\n", "word\n", "
\n", "
str
\n", "\n", " Word as it appears in the text (excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordlevel\n", "
\n", "
str
\n", "\n", " Number of parent wordgroups for a word\n", "\n", "
\n", "\n", "
\n", "
\n", "wordrole\n", "
\n", "
str
\n", "\n", " Role of the word (abbreviated)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordrolelong\n", "
\n", "
str
\n", "\n", " Role of the word (full)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordtranslit\n", "
\n", "
str
\n", "\n", " Transliteration of the text (in latin letters, excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "wordunacc\n", "
\n", "
str
\n", "\n", " Word without accents (excl. punctuations)\n", "\n", "
\n", "\n", "
\n", "
\n", "oslots\n", "
\n", "
none
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", " Settings:
specified
  1. apiVersion: 3
  2. appName: tonyjurg/Nestle1904LFT
  3. appPath:C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
  4. commit: f2eb5e2b0f8805ad720d91a5cb9e2aa2fdc6c99a
  5. css: ''
  6. dataDisplay:
    • excludedFeatures: [reference]
    • noneValues:
      • none
      • unknown
      • no value
      • NA
      • ''
  7. interfaceDefaults: {fmt: layout-orig-full}
  8. isCompatible: True
  9. local: no value
  10. localDir:C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
  11. provenanceSpec:
    • corpus: Nestle 1904 (Low Fat Tree)
    • org: tonyjurg
    • relative: /tf
    • repo: Nestle1904LFT
    • repro: Nestle1904LFT
    • version: 0.5
  12. release: v03
  13. showVerseInTuple: 0
  14. typeDisplay:
    • book:
      • label: {book}
      • style: ''
    • chapter:
      • label: {chapter}
      • style: ''
    • sentence:
      • hidden: 0
      • label: {sentence}
      • style: ''
    • verse:
      • label: {verse}
      • style: ''
    • wg:
      • hidden: 0
      • label: {rule} {clausetype} {wgrolelong} {junction}
      • style: ''
    • word:
      • base: True
      • features:
        • lemma
        • strongs
      • featuresBare: [gloss]
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "App config error(s) in wg:\n", "\tlabel: feature rule not loaded\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# load the app and data\n", "N1904 = use (\"tonyjurg/Nestle1904LFT:latest\", hoist=globals())" ] }, { "cell_type": "markdown", "id": "58ef1678-a19d-4c0c-80f3-84f8471a90e2", "metadata": { "tags": [] }, "source": [ "# 3 - Performing the queries " ] }, { "cell_type": "markdown", "id": "211b2bde-002b-4243-87c9-4bd850868354", "metadata": { "jupyter": { "outputs_hidden": true }, "tags": [] }, "source": [ "## 3.1 - Frequency of punctuations in corpus \n", "##### [Back to TOC](#TOC)\n", "\n", "This code generates a table that displays the frequency of punctuations behind words within the Text-Fabric corpus. The API call C.characters.data retrieves the data in the form of a Python dictionary. The subsequent code unpacks and sorts this dictionary to present the table. It's important to note that since the query is based on the 'word' feature, there are no spaces behind the words." ] }, { "cell_type": "code", "execution_count": 5, "id": "b8e8ce2d-43db-48dd-ace9-2156c7046692", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.12s 18507 results\n", "╒═══════════════╤═════════════╕\n", "│ Punctuation │ Frequency │\n", "╞═══════════════╪═════════════╡\n", "│ . │ 5712 │\n", "├───────────────┼─────────────┤\n", "│ , │ 9441 │\n", "├───────────────┼─────────────┤\n", "│ · │ 2355 │\n", "├───────────────┼─────────────┤\n", "│ ; │ 969 │\n", "├───────────────┼─────────────┤\n", "│ — │ 30 │\n", "╘═══════════════╧═════════════╛\n" ] } ], "source": [ "# Library to format table\n", "from tabulate import tabulate\n", "\n", "# The actual query (see section 3.2 about the used RegExp in this query)\n", "SearchPunctuations = '''\n", "word word~([\\.·—,;])$\n", "'''\n", "PunctuationList = N1904.search(SearchPunctuations)\n", "\n", "ResultDict = {}\n", "for tuple in PunctuationList:\n", " node=tuple[0]\n", " Punctuation=F.word.v(node)[-1] \n", " # Check if this Punctuation already exists in ResultDict\n", " if Punctuation in ResultDict:\n", " # If it exists, add the count to the existing value\n", " ResultDict[Punctuation]+=1\n", " else:\n", " # If it doesn't exist, initialize the count as the value\n", " ResultDict[Punctuation]=1\n", "\n", "# Convert the dictionary into a list of key-value pairs\n", "TableData = [[key, value] for key, value in ResultDict.items()]\n", "\n", "# Produce the table\n", "headers = [\"Punctuation\",\"Frequency\"]\n", "print(tabulate(TableData, headers=headers, tablefmt='fancy_grid'))\n" ] }, { "cell_type": "markdown", "id": "1c71a7cd", "metadata": {}, "source": [ "## 3.2 Explanation of the Regular Expression \n", "##### [Back to TOC](#TOC)" ] }, { "cell_type": "markdown", "id": "01d81446", "metadata": {}, "source": [ "The regular expression `[\\.·—,;]$` matches any one character from the set containing `.`, `·`, `—`, `,`, or `;`. The `$` anchor ensures that this character is at the end of the string. Hence, the regular expression will only be true if any of these characters is found at the last position of a word node. If the `$` anchor is omitted, there might be false positives due to the existence of 16 word nodes that start with the character `—`. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }