{ "cells": [ { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "\n", "\n", "\n", "\n", "\n", "# Phonetic Transliteration of Hebrew Masoretic Text\n", "\n", "# Frequently asked questions\n", "\n", "Q: *What is the use of a phonetic transliteration of the Hebrew Bible? What can anyone wish beyond the careful, meticulous Masoretic system of consonants, vowels and accents?*\n", "\n", "A: Several things:\n", "\n", "* the Hebrew Bible may be subject of study in various fields,\n", " where the people involved do not master the Hebrew script;\n", " a phonetic transcription removes a hurdle for them.\n", "* in computational linguistics there are many tools that deal with written language in Latin alphabets;\n", " even a simple task as getting the consonant-vowel pattern of a word is unnecessarily complicated\n", " when using the Hebrew script.\n", "* in phonetics and language learning theory, it is important to represent the sounds without being burdened\n", " by the idiosyncracies of the writing system and the spelling.\n", "\n", "Q: *But surely, there already exist transliterations of Hebrew? Why not use them?*\n", "\n", "Here are a few pragmatic reasons:\n", "\n", "* we want to be able to *compute* a transliteration based upon our own data;\n", "* we want to gain insight in to what extent the transliteration can be purely rule-based, and to what extent\n", " it depends on lexical information that you just need to know;\n", "* we want to make available a well documented transliteration, that can be studied, borrowed and improved by others.\n", "\n", "Q: *But how **good** is your transliteration?*\n", "\n", "we do not know, ..., yet. A few remarks though:\n", "\n", "* we have applied most of the *rules* that we could find in Hebrew grammars;\n", "* we have suspended some of the rules for some verb paradigms where it is known that they lead to incorrect results\n", "* where the rules did not suffice, we have searched the corpus for other occurrences of the same word, to get clues;\n", "* where we knew that clues pointed in the wrong direction, we have applied a list of exceptions (currently a list of only the word בָּתִּֽים (\\*bottˈîm => bāttˈîm)\n", "* we have a fair test set with critical cases that all pass\n", "* we have a few tables of all cases where the algorithm has made corpus based decisions and lexical decisions\n", "* we are open for your corrections: login into [SHEBANQ](https://shebanq.ancient-data.org), go to a passage with offending phonetic transliteration, and make a manual note. **Tip:** Give that note the keyword ``phono``, then we\n", " will collect them.\n", "\n", "Q: *To me, this is not entirely satisfying.*\n", "\n", "A: Fair enough. Consider jumping to [Bible Online Learner](http://bibleol.3bmoodle.dk/text/show_text),\n", "where they have built in a pretty good transliteration, based on a different method of rule application. It is documented in an article by Nicolai Winther-Nielsen:\n", "[Transliteration of Biblical Hebrew for the Role-Lexical Module](http://www.see-j.net/index.php/hiphil/article/view/62)\n", "and additional information can be found in Claus Tøndering's\n", "[Bible Online Learner, Software on GitHub](https://github.com/EzerIT/BibleOL).\n", "See also [Lex: A software project for linguists](http://www.see-j.net/index.php/hiphil/article/view/60/56).\n", "\n", "We are planning to conduct an automatic comparison of both transliteration schemes over the whole corpus.\n", "\n", "Q: *Who is the **we**?*\n", "\n", "That is the author of this notebook, [Dirk Roorda](mailto:dirk.roorda@dans.knaw.nl), working together with Martijn Naaijer and getting input from Nicolai Winther-Nielsen and Willem van Peursen." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview of the results\n", "\n", "1. The main result is a python function ``phono(``*ETCBC-original*``, ...): ``*phonetic transliteration*.\n", "1. Showcases and tests: how the function solves particular classes of problems.\n", " The *cases* file shows a set of cases that have been generated in the last run.\n", "\n", "The *tests* files show a prepared set of cases, against which to test new versions of the algorithm. These results have been obtained on version `c` of the\n", "[BHSA dataset](https://etcbc.github.io/bhsa).\n", " 1. [mixed](mixedc.html)\n", " with log file\n", " [mixed_debug](mixed_debugc.txt).\n", " 1. [qamets-non-verb cases](qamets_nonverb_casesc.html)\n", " and\n", " [qamets-non-verb tests](qamets_nonverb_testsc.html)\n", " with log file\n", " [qamets-nonverb_tests_debug](qamets_nonverb_tests_debugc.txt).\n", " The result of searching the corpus for related occurrences and\n", " having them vote for qatan/gadol interpretation of the qamets.\n", " 1. [qamets-verb cases](qamets_verb_casesc.html)\n", " and\n", " [qamets-verb tests](qamets_verb_testsc.html)\n", " with log file\n", " [qamets-verb tests-debug](qamets_verb_tests_debugc.txt).\n", " The result of suppressing the qatan interpretation of the qamets regardless of accent\n", " for a definite set of *verb forms*.\n", " 1. [qamets-prs cases](qamets_prs_casesc.html)\n", " and\n", " [qamets-prs tests](qamets_prs_testsc.html)\n", " with log file\n", " [qamets-prs tests-debug](qamets_prs_tests_debugc.txt).\n", " The result of suppressing the qatan interpretation of the qamets in *pronominal suffixes*.\n", "1. A [plain text](combi.txt) with the complete text in BHSA transliteration and phonetic transcription,\n", " verse by verse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview of the method\n", "\n", "## High-level description\n", "\n", "1. **BHSA transliteration**\n", " Our starting point is the BHSA full transliteration of the Hebrew Masoretic text.\n", " This transliteration is in 1-1 correspondence with the Masoretic text, including all vowels and accents.\n", "1. **Grammar rules**\n", " We have implemented the rules we find in grammars of Hebrew about long and short qamets, mobile and silent schwa,\n", " dagesh, and mater lectionis.\n", " The implementation takes the form of a row of *regular expressions*,\n", " where we transliterate targeted pieces of the original.\n", " These regular expressions are exquisitely formulated, and must be applied in the given order.\n", " *Beware:* Seemingly innocent modifications in these expressions or in the order of application,\n", " may ruin the transcription completely.\n", "1. **Qamets puzzles: verbs**\n", " In many verb forms the grammar rules would dictate that a certain qamets is qatan while in fact it is gadol.\n", " In most cases this is caused by the fact that no accent has been marked on the syllable that carries the\n", " qamets in question. There is a limited set of verb paradigms where this occurs.\n", " We detect those and suppress qamets qatan interpretation for them.\n", "1. **Qamets puzzles: non-verbs**\n", " There are quite a few non-verb occurrences where the accent pattern of a word invites a qamets to become\n", " qatan, that is, by the grammar rules.\n", " Yet, other occurrences of the same lexeme have other accent patterns, and\n", " lead to a gadol interpretation of the same qamets.\n", " In this case we count the unique cases in favor of gadol versus qatan, and let the majority decide for all\n", " occurrences. In cases where we know that the majority votes wrong, we have intervened.\n", "\n", "### Qamets work hypothesis\n", "Note, that in the *non-verb qamets puzzles* we have tacitly made the assumption that qamets qatan and gadol are not phonological variants of each other.\n", "In other words, it never occurs that a qamets gadol becomes shortened into a qamets qatan.\n", "From the grammar rules it follows that short versions of the qamets can only be\n", "\n", "* patah\n", "* schwa\n", "* composite schwa with patah\n", "\n", "and never\n", "\n", "* qamets qatan\n", "* composite schwa with qamets\n", "\n", "Whether this hypothesis is right, is not my competence.\n", "We just use it as a working hypothesis.\n", "\n", "## Lexical information\n", "\n", "This method is not a pure method, in the sense that it works only with the information given in the source string.\n", "We *cheat*, i.e. we use morphological information from the BHSA database to\n", "steer us into the right direction. To this end, the input of the `phono()` is always a\n", "Text-Fabric node, from which we can get all information we need.\n", "\n", "More precisely, the input is a sequence of nodes.\n", "This sequence is meant to correspond to a sequence of slots belonging to words that are written adjacently\n", "(no space between, no maqef between).\n", "From these nodes we can look up:\n", "\n", "* the BHSA transliteration\n", "* the qere (if there is a discrepancy between ketiv and qere)\n", "* additional lexical information (taken from the last node)\n", "\n", "## Combined words\n", "\n", "You can use `phono()` to transliterate multiple words at the same time, but you can also do individual words,\n", "even if in Hebrew they are written together.\n", "However, it is better to feed combined words to `phono()` in one go, because the prefix word may influence the transliteration of the postfix word. Think of the article followed by word starting with a `BGDKPT` letter.\n", "The dagesh in the `BGDKPT` is interpreted as a lene, if the word stands on its own, but as a forte if it is combined.\n", "\n", "However, it not not advised to feed longer strings to `phono()`, because when phono retrieves lexical information, it uses the information of the last node that matches a word in the input string.\n", "\n", "## Accents\n", "\n", "We determine \"primary\" and \"secondary\" stress in our transliteration, but this must not be taken in a phonetic sense.\n", "Every syllable that carries an accent pointing will get a primary stress mark.\n", "However, a few specific accent pointings are not deemed to produce an accent, and an other group of accents\n", "is deemed to produce only a secondary accent.\n", "The last syllable of a word also gets a secondary accent by default.\n", "We have not yet tried to be more precise in this, so *segolates* do not get the treatment they deserve.\n", "\n", "The main rationale for accents is that they prevent a qamets to be read as qatan.\n", "\n", "## Individual symbols\n", "\n", "We have made a careful selection of UNICODE symbols to represent Hebrew sounds.\n", "Sometimes we follow the phonetic usage of the symbols, sometimes we follow wide spread custom.\n", "The actual mapping can be plugged in quite easily,\n", "and the intermediate stages in the transformation do not use these symbols,\n", "so the algorithm can be easily adapted to other choices.\n", "\n", "### Consonants\n", "\n", "Provided it is not part of a long vowel, we write `י` as `y`,\n", "whilst `j` would be more in line with the phonetic alphabet.\n", "\n", "Likewise, we write `ו` as `w`, if it is not part of a long vowel.\n", "If a word ends in `יו` the `ו` is not a mater lectionis, and the `י` gets elided.\n", "We represent this phonetically as `ʸw`.\n", "\n", "With regards to the `BGDKPT` letters,\n", "it would have been attractive to use the letters `b g d k p t` without\n", "diacritic for the plosive variants, and with a suitable diacritic for the fricative variants.\n", "Alas, the UNICODE table does not offer such a suitable diacritic that is available for all these particular 6 letters.\n", "\n", "So, we use `b g d k p t` for the plosives, but for the fricatives we use `v ḡ ḏ ḵ f ṯ`.\n", "\n", "With regards to the *emphatic* consonants ט and ח and צ we\n", "represent them with a dot below: `ṭ ḥ ṣ`.\n", "ק is just `q`.\n", "\n", "\n", "ע and א translate to `ʕ` and `ʔ`.\n", "\n", "שׁ and שׂ translate to `š` and `ś`.\n", "ס is just `s`.\n", "\n", "When א and ה are mater lectionis, they are left out. A ה with mappiq becomes just `h`,\n", "like every ה which is not a mater lectionis.\n", "\n", "We do not mark the deviant final forms of the consonants ך and ם and ן and ף and ץ, assuming that\n", "this is just a scriptural peculiarity, with no effect on the actual sounds.\n", "\n", "The remaining consonants go as follows:\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
לl
מm
נn
רr
זz
\n", "\n", "### Vowels\n", "\n", "The short vowels (patah, segol, hireq, qamets qatan, qibbuts) are just `a e i o u`.\n", "\n", "However, the *furtive* patah is a `ₐ` in front of its consonant.\n", "\n", "The long vowels without yod or waw (qamets gadol, tsere, holam) have a bar above `ā ē ō`.\n", "\n", "The complex vowels (tsere or hireq plus yod, holam plus waw, waw with dagesh) have a circumflex `ê î ô û`.\n", "\n", "A segol followed by yod becomes `eʸ`\n", "\n", "The composite schwas (patah, segol, qamets) are written as superscripts `ᵃ ᵉ ᵒ`.\n", "\n", "The simple schwa is left out if silent, and otherwise it becomes `ᵊ`.\n", "\n", "### Accent\n", "\n", "The primary and secondary stress are marked as `ˈ ˌ` and are placed *in front of the vowel they occur with*.\n", "\n", "### Punctuation\n", "\n", "The sof-pasuq ׃ becomes `.`.\n", "If it is followed by ס (setumah) or ף (petuhah) or ̇׆ (nun-hafukha), these extra symbols are omitted.\n", "\n", "The maqef ־ (between words) becomes `-`.\n", "\n", "If words are juxtaposed without space in the Hebrew, they are also juxtaposed without space in the phonetic\n", "transliteration.\n", "\n", "### Tetragrammaton\n", "\n", "The tetragrammaton is transliterated with the vowels it is encountered with, but the whole is put between\n", "square brackets `[ ]`.\n", "\n", "### Ketiv-qere\n", "\n", "We base the phonetics on the (vocalized) qere, if a qere is present.\n", "The ketiv is then ignored. We precede each such word by a `*` to indicate that the qere\n", "is deviant from the ketiv. Using the data view in SHEBANQ it is possible to see what the ketiv is.\n", "\n", "## Cleaning up\n", "\n", "We leave the accents and the schwas in the end product of the `phono()` function,\n", "despite the fact that the accents, as they appear, do not have consistent phonetic significance.\n", "And it can be argued that every schwa is silent.\n", "If you do not care for schwas and accents, it is easy to remove them.\n", "Also, if you find the results in separating the qamets into qatan and gadol unsatisfying or irrelevant, you can\n", "just replace them both by a single symbol, such as `å`.\n", "\n", "## Testing\n", "\n", "Quite a bit of code is dedicated to count special cases, to test, and to produce neat tables with interesting forms.\n", "It is also possible to call the `phono()` function in debug mode, which will write to a text file all stages in the\n", "transliteration from BHSA original into the phonetic result." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Load the modules" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "import sys\n", "import os\n", "import collections\n", "import re\n", "import yaml\n", "import utils\n", "from tf.fabric import Fabric\n", "from tf.writing.transcription import Transcription\n", "from tf.core.helpers import formatMeta" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Pipeline\n", "See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)\n", "for how to run this script in the pipeline." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "if \"SCRIPT\" not in locals():\n", " SCRIPT = False\n", " FORCE = True\n", " CORE_NAME = \"bhsa\"\n", " NAME = \"phono\"\n", " VERSION = \"2021\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def stop(good=False):\n", " if SCRIPT:\n", " sys.exit(0 if good else 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook can run a lot of tests and create a lot of examples.\n", "However, when run in the pipeline, we only want to create the two `phono` features.\n", "\n", "So, further on, there will be quite a bit of code under the condition `not SCRIPT`." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Setting up the context: source file and target directories\n", "\n", "The conversion is executed in an environment of directories, so that sources, temp files and\n", "results are in convenient places and do not have to be shifted around." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "repoBase = os.path.expanduser(\"~/github/etcbc\")\n", "coreRepo = \"{}/{}\".format(repoBase, CORE_NAME)\n", "thisRepo = \"{}/{}\".format(repoBase, NAME)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "coreTf = \"{}/tf/{}\".format(coreRepo, VERSION)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "thisTemp = \"{}/_temp/{}\".format(thisRepo, VERSION)\n", "thisTempTf = \"{}/tf\".format(thisTemp)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "thisTf = \"{}/tf/{}\".format(thisRepo, VERSION)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Test\n", "\n", "Check whether this conversion is needed in the first place.\n", "Only when run as a script." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "if SCRIPT:\n", " (good, work) = utils.mustRun(\n", " None, \"{}/.tf/{}.tfx\".format(thisTf, \"phono\"), force=FORCE\n", " )\n", " if not good:\n", " stop(good=False)\n", " if not work:\n", " stop(good=True)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Load the TF data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 0.00s Load the existing TF dataset .\n", "..............................................................................................\n", "This is Text-Fabric 9.1.7\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "114 features found and 0 ignored\n" ] } ], "source": [ "utils.caption(4, \"Load the existing TF dataset\")\n", "TF = Fabric(locations=coreTf, modules=[\"\"])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s loading features ...\n", " | 0.00s Dataset without structure sections in otext:no structure functions in the T-API\n", " 10s All features loaded/computed - for details use TF.isLoaded()\n" ] }, { "data": { "text/plain": [ "[('Computed',\n", " 'computed-data',\n", " ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),\n", " ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),\n", " ('Fabric', 'loading', ('TF',)),\n", " ('Locality', 'locality', ('L Locality',)),\n", " ('Nodes', 'navigating-nodes', ('N Nodes',)),\n", " ('Features',\n", " 'node-features',\n", " ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),\n", " ('Search', 'search', ('S Search',)),\n", " ('Text', 'text', ('T Text',))]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "api = TF.load(\n", " \"\"\"\n", " qere qere_trailer\n", " g_word_utf8 g_cons_utf8 trailer\n", " g_word g_cons lex_utf8 lex lex0\n", " sp vs vt gn nu ps st\n", " uvf prs g_prs pfm vbs vbe\n", " languageISO\n", "\"\"\"\n", ")\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The source string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is what we use as our starting point: the BHSA transliteration, with one or two tweaks.\n", "\n", "The BHSA transliteration encodes also what comes after each word until the next word.\n", "Sometimes we want that extra bit, and sometimes not, and sometimes part of it." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Patterns" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# punctuation\n", "punctuation = re.compile(\n", " r\"\"\"\n", " (?: [ -]\\s*\\Z) # space, (no maqef) or nospace\n", " | (?:\n", " 0[05] # sof pasuq or paseq\n", " (?:_[SNP])* # nun hafukha, setumah, petuhah at end of verse\n", " \\s*\\Z\n", " )\n", " | (?:_[SPN]\\s*\\Z) # nun hafukha, setumah, petuhah between words\n", "\"\"\",\n", " re.X,\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "split_punctuation = re.compile(\n", " r\"\"\"\n", " (.*?) # part before punctuation\n", " ((?: # punctuation itself\n", " (?: [ &-]\\s*) # space, maqef, or nospace\n", " | (?:\n", " 0[05] # sof pasuq or paseq\n", " (?:_[SNP])* # nun hafukha, setumah, petuhah at end of verse\n", " \\s*\n", " )\n", " | (?:_[SPN]\\s*) # nun hafukha, setumah, petuhah between words\n", " )*)\n", "\"\"\",\n", " re.X,\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "start_punct = re.compile(\n", " r\"\"\"\n", " (?: \\A[ &-]\\s*) # space, maqef or nospace\n", " | (?:\n", " \\A\n", " 0[05] # sof pasuq or paseq\n", " (?:_[SNP])* # nun hafukha, setumah, petuhah at end of verse\n", " \\s*\n", " )\n", " | (?:\\A\\s*_[SPN]\\s*) # nun hafukha, setumah, petuhah between words\n", "\"\"\",\n", " re.X,\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "noorigspace = re.compile(\n", " r\"\"\"\n", " (?: [&-]\\Z) # space, maqef or nospace\n", " | (?:\n", " 0[05] # sof pasuq or paseq\n", " (?:_[SNP])* # nun hafukha, setumah, petuhah at end of verse\n", " \\Z\n", " )\n", " | (?:_[SPN])+ # nun hafukha, setumah, petuhah between words\n", "\"\"\",\n", " re.X,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "setumah and petuhah\n", "Usually, setumah and petuhah occur after the end of verse sign.\n", "In that case we can strip them.\n", "Sometimes they occur inter-word. Then we have to replace them by a space\n", "because the words are otherwise adjacent.\n", "This operation must be performed before originals are glued together,\n", "because the `_S` and `_P` can only be reliably detected if they are at the end of a word.\n", "So: set_pet to be used before phono(), in get_orig, but only if get_orig is\n", "used for phono()." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "set_pet_pattern = re.compile(r\"((?:0[05])?)(_[SNP])+\\Z\")\n", "tetra_lex = \"JHWH/\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def set_pet_pattern_repl(match):\n", " (punct, nsp) = match.groups()\n", " sep = \" § \" if punct == \"\" and nsp != \"\" else \"\"\n", " return punct + sep" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Actions" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def get_orig(w, punct=True, set_pet=False, tetra=True, give_ketiv=False):\n", " proto = F.g_word.v(w) + F.trailer.v(w)\n", " qere = F.qere.v(w)\n", " qere_trailer = F.qere_trailer.v(w)\n", " if qere_trailer == \"\":\n", " qere_trailer = \"-\"\n", " orig = proto if give_ketiv or qere is None else qere + qere_trailer\n", " if tetra and F.lex.v(w) == tetra_lex:\n", " (mat, sep) = split_punctuation.fullmatch(orig).groups()\n", " orig = \"[ \" + mat + \" ]\" + sep\n", " if not punct:\n", " orig = punctuation.sub(\"\", orig)\n", " else:\n", " # if not noorigspace.search(orig):\n", " # orig += ' '\n", " if not set_pet:\n", " orig = set_pet_pattern.sub(set_pet_pattern_repl, orig)\n", " return orig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "find the first occurrence of the string orig in the verse (ETCBC representation)\n", "Then deliver the sequence of nodes corresponding to that sequence\n", "it turns out that too much is happening with accents, so I will \"normalize\" the accents for the\n", "sake of looking up" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "digit = re.compile(\"[0-9]+\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def find_w(passage, orig, debug=False):\n", " if len(orig) == 0:\n", " return None\n", " vn = T.nodeFromSection(passage, lang=\"la\")\n", " verse_words = L.d(vn, \"word\")\n", " results = None\n", " orig = orig.strip() + \" \"\n", " lvw = len(verse_words)\n", " for i in range(lvw):\n", " target = orig\n", " for j in range(i, lvw + 1):\n", " target = start_punct.sub(\"\", target)\n", " target = digit.sub(\"\", target)\n", " if len(target) == 0:\n", " results = verse_words[i:j]\n", " break\n", " if j >= lvw:\n", " break\n", " j_orig = digit.sub(\n", " \"\",\n", " get_orig(\n", " verse_words[j],\n", " punct=False,\n", " tetra=False,\n", " give_ketiv=True,\n", " ),\n", " ).rstrip(\"&\")\n", " if target.startswith(j_orig):\n", " if debug:\n", " TF.info(\"{}-{}: [{}] <= [{}]\".format(i, j, j_orig, target))\n", " target = target[len(j_orig) :]\n", " if debug:\n", " TF.info(\"{}-{}: [{}]\".format(i, j, target))\n", " continue\n", " if debug:\n", " TF.info(\"{}-{}: [{}] \", \"alef\", \"ʔ\"),\n", " (\"<\", \"ayin\", \"ʕ\"),\n", " (\"v\", \"tet\", \"ṭ\"),\n", " (\"y\", \"tsade\", \"ṣ\"),\n", " (\"x\", \"chet\", \"ḥ\"),\n", " (\"c\", \"shin\", \"š\"),\n", " (\"f\", \"sin\", \"ś\"),\n", " (\"#\", \"s(h)in\", \"ŝ\"),\n", " (\"ij\", \"long hireq\", \"î\"),\n", " (\"I\", \"short hireq\", \"i\"),\n", " (\";j\", \"long tsere\", \"ê\"),\n", " (\"ow\", \"long holam\", \"ô\"),\n", " (\"w.\", \"long `qibbuts`\", \"û\"),\n", " (\"ej\", \"e glide\", \"eʸ\"),\n", " (\"j\", \"yod\", \"y\"),\n", " (\":a\", \"hataf patach\", \"ᵃ\"),\n", " (\":@\", \"hataf qamats\", \"ᵒ\"),\n", " (\":e\", \"hataf segol\", \"ᵉ\"),\n", " (\"%\", \"schwa mobile\", \"ᵊ\"),\n", " (\":\", \"schwa quiescens\", \"\"),\n", " (\"@\", \"qamats gadol\", \"ā\"),\n", " (\"a\", \"patach\", \"a\"),\n", " (\"`\", \"furtive patach\", \"ₐ\"),\n", " (\"+\", \"qamats\", \"å\"),\n", " (\"e\", \"segol\", \"e\"),\n", " (\n", " \";\",\n", " \"tsere\",\n", " \"ē\",\n", " ),\n", " (\"i\", \"hireq\", \"i\"),\n", " (\"o\", \"holam\", \"ō\"),\n", " (\"^\", \"qamats qatan\", \"o\"),\n", " (\"u\", \"qibbuts\", \"u\"),\n", " (\"b.\", \"b plosive\", \"B\"),\n", " (\"g.\", \"g plosive\", \"G\"),\n", " (\"d.\", \"d plosive\", \"D\"),\n", " (\"k.\", \"k plosive\", \"K\"),\n", " (\"p.\", \"p plosive\", \"P\"),\n", " (\"t.\", \"t plosive\", \"T\"),\n", " (\"b\", \"b fricative\", \"v\"),\n", " (\"g\", \"g fricative\", \"ḡ\"),\n", " (\"d\", \"d fricative\", \"ḏ\"),\n", " (\"k\", \"k fricative\", \"ḵ\"),\n", " (\"p\", \"p fricative\", \"f\"),\n", " (\"t\", \"t fricative\", \"ṯ\"),\n", " (\"B\", \"b plosive\", \"b\"),\n", " (\"G\", \"g plosive\", \"g\"),\n", " (\"D\", \"d plosive\", \"d\"),\n", " (\"K\", \"k plosive\", \"k\"),\n", " (\"P\", \"p plosive\", \"p\"),\n", " (\"T\", \"t plosive\", \"t\"),\n", " (\"w\", \"waw\", \"w\"),\n", " (\"l\", \"lamed\", \"l\"),\n", " (\"m\", \"mem\", \"m\"),\n", " (\"n\", \"nun\", \"n\"),\n", " (\"r\", \"resh\", \"r\"),\n", " (\"z\", \"zajin\", \"z\"),\n", " (\"!\", \"primary accent\", \"ˈ\"),\n", " (\"/\", \"secundary accent\", \"ˌ\"),\n", " (\"&\", \"maqef\", \"-\"),\n", " (\"*\", \"masora\", \"*\"),\n", ")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "specials2 = (\n", " (\"$\", \"sof pasuq\", \".\"),\n", " (\"|\", \"paseq\", \" \"),\n", " (\"§\", \"interword setumah and petuhah\", \" \"),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Assembling the symbols in dictionaries\n", "\n", "We compile the table of symbols in handy dictionaries for ease of processing later.\n", "\n", "We need to quickly detect the dagesh lenes later on, so we store them in a dictionary.\n", "\n", "Our treatment of accents is still primitive.\n", "\n", "We ignore some accents (``irrelevant accents`` below) and we consider some accents as indicators of a mere\n", "*secondary* accent (``secundary accents`` below).\n", "\n", "The ``sound_dict`` is the resulting (ordered) mapping of all source characters to \"phonetic\" characters." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "dagesh_lenes = {\"b.\", \"g.\", \"d.\", \"k.\", \"p.\", \"t.\"}\n", "dagesh_lene_dict = dict()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "irrelevant_accents = (\n", " (\"01\", \"segol\"), # occurs always with another accent\n", " (\"03\", \"pashta\"), # by definition on last syllable: not relevant for accent\n", " (\"04\", \"telisha qetana\"),\n", " (\"14\", \"telisha gedola\"),\n", " (\"24\", \"telisha qetana\"),\n", " (\"44\", \"telisha gedola\"),\n", ")\n", "secundary_accents = (\n", " (\"71\", \"merkha\"), # ??\n", " (\"63\", \"qadma\"), # ??\n", " (\"73\", \"tipeha\"), # ??\n", ")\n", "punctuation_accents = (\n", " (\"00\", \"sof pasuq\"),\n", " (\"05\", \"paseq\"),\n", ")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "known_accents = {\n", " x[0] for x in irrelevant_accents + secundary_accents + punctuation_accents\n", "}" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "primary_accents = {\n", " \"{:>02}\".format(i) for i in range(100) if \"{:>02}\".format(i) not in known_accents\n", "}\n", "sound_dict = collections.OrderedDict()\n", "sound_dict2 = collections.OrderedDict()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "for (sym, let, glyph) in specials:\n", " if sym in dagesh_lenes:\n", " dagesh_lene_dict[sym[0]] = glyph\n", " else:\n", " sound_dict[sym] = glyph" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "for (sym, let, glyph) in specials2:\n", " sound_dict2[sym] = glyph" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Patterns\n", "\n", "The ``phono()`` function that we will define (far) below, performs an ordered sequence of transformations.\n", "Most of these are defined as [regular expressions](http://www.regular-expressions.info),\n", "and some parts of those expressions occur over and over again, e.g. subpatterns for *vowel* and *consonant*.\n", "\n", "Here we define the shortcuts that we are going to use in the regular expressions.\n", "\n", "## Details of the matching process\n", "\n", "Normally, when a pattern matches a string, the string is consumed: the parts of the pattern that match\n", "consume corresponding stretches of the string.\n", "However, in many cases a pattern specifies specific contexts in which a match should be found.\n", "In those cases we do not want that the context parts of the pattern are responsible for string\n", "consumption, because in those parts there could be another relevant match.\n", "\n", "In regular expression there is a solution for that: look-ahead and look-behind assertions and we use them frequently.\n", "\n", "``(?<=`` *before-pattern* ``)`` *pattern* ``(?=`` *behind-pattern* ``)``\n", "\n", "A match of this pattern in a string is a portion of a string that matches *pattern*, provided that\n", "portion is preceded by *before-pattern* and followed by *behind* pattern.\n", "\n", "If there is a match, and new matches must be searched for, the search will start right after *pattern*.\n", "\n", "Instead of the above *positive* look-ahead and look-behind assertions, there are also *negative* variants:\n", "\n", "``(?\n", "kindgreedynon-greedypossessive\n", " 0 or more**?*+\n", "1 or more++?++\n", "at least *n*, at most *m*\n", " {*n*, *m*}\n", " {*n*, *m*}?\n", " {*n*, *m*}+\n", " \n", "\n", "\n", "For example, the pattern ``[ab]*b`` matches substrings of ``a``s and ``b``s that end in a ``b``.\n", "In order to match the string ``aaaaab``, the ``[a|b]*`` part starts with greedily consuming the whole string,\n", "but after discovering that the ``b`` part in the pattern should also match something, the ``[a|b]*`` part\n", "reluctantly gives back one occurrence. That will do the trick.\n", "\n", "However, ``[ab]*+b`` will not match ``aaaaab``, because the possessive quantifier gives nothing back.\n", "\n", "Possessive quantifiers a desirable in combination with negative look-behind assertions.\n", "\n", "For example, take ``[ab]*+(?!c)$``. This will match substrings of ``a``s and ``b``s that are not followed by ``c``.\n", "So it matches ``ababab`` but not ``abababc``.\n", "However, the non-possessive variant, ``[ab]*(?!c)`` matches both. So how does it match ``abababc``?\n", "First, the ``[ab]*`` part matches all ``a``s and ``b``s. Then the look-behind assertion that ``c`` does not follow,\n", "is violated. So ``[ab]*`` backtracks one occurrence, a ``b``. At that point the look-behind assertion finds a ``b``\n", "which is not ``c``, and the match succeeds.\n", "\n", "Python lacks *possessive* quantifiers in regular expressions, so again, this makes some expressions below more complicated than they were otherwise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to test for vowels in look-behind conditions.\n", "Python insists that look-behind conditions match patterns with fixed length.\n", "Vowels have variable length, so we need to take a bit more context.\n", "This extra context is dependent on whether the vowel occurs in front of a consonant or after it\n", "vowel 1 is for before, vowel 2 is for after, both are usable in look-behind conditions\n", "vowel matches purely vowels of variable length, and is not usable in look-behind conditions" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "vowel1 = r\"(?:(?::[ea@])|(?:w\\.)|(?:[i;]j)|(?:ow)|(?:.[%@\\^;aeiIou`]))\"\n", "vowel2 = r\"(?:(?::[ea@])|(?:w\\.)|(?:[i;]j)|(?:ow)|(?:[%@\\^;aeiIou`].))\"\n", "vowel = r\"(?:(?::[ea@])|(?:w\\.)|(?:[i;]j)|(?:ow)|(?:[%@\\^;aeiIou`]))\"" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# lvowel are long vowels only (including compositions)\n", "# svowel are short vowels only, including composite schwas\n", "lvowel1 = r\"(?:(?:w\\.)|(?:[i;]j)|(?:ow)|(?:.[@;o]))\"\n", "svowel = r\"(?:(?::[ea@])|(?:[%@\\^;aeiIou`]))\"" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "gadol = sound_dict[\"@\"]\n", "qatan = sound_dict[\"^\"]\n", "a_like = {\":a\", \"a\"}\n", "o_like = {\":@\", \"o\", \"ow\", \"u\", \"w.\"}\n", "e_like = {\":\", \":e\", \";\", \";j\", \"e\", \"i\", \"ij\"}" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# complex i/w vowel: the composite vowels with waw and yod, after translation\n", "complex_i_vowel = \"\".join(sound_dict[s] for s in {\"ij\", \";j\"})\n", "complex_w_vowel = \"\".join(sound_dict[s] for s in {\"ow\"})" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# consonants\n", "ncons = \"[^>bgdhwzxvjklmnsbgdhwzxvjklmns 1A\n", "JM/ => 1O\n", "JWMM => 2A\n", "JRB 1A\n", "JHWNTN/ => 2A\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "xxx = \"\"\"\n", " 2A\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "# there are unaccented conjugated verb forms that must not be subjected to qamets-qatan transformation\n", "qamets_qatan_verb_x = {\n", " \"verb qal perf 3sf\",\n", " \"verb qal perf 3p-\",\n", " \"verb nif impf 1s-\",\n", " \"verb nif impf 1p-\",\n", " \"verb nif impf 2sf\",\n", " \"verb nif impf 2pm\",\n", " \"verb nif impf 3pm\",\n", " \"verb nif impv 2sf\",\n", " \"verb nif impv 2pm\",\n", "}\n", "qqv_experimental = {\n", " \"verb qal impf 3pm\",\n", "}" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "qamets_qatan_verb_x |= qqv_experimental" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def qamets_qatan_verb_x_repl(match):\n", " return match.group(1) + \"@\"" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "for the use of applying individual corrections:" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Actions\n", "Here is the function that carries out rule based qamets qatan detection, without going into\n", "verb paradigms and exceptions. It is the first go at it." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def doplainqamets(word, accentless=False, debug=False, count=False):\n", " dout = []\n", " result = word\n", " if accentless:\n", " result = result.replace(\"!\", \"\").replace(\"/\", \"\")\n", " if count:\n", " pre = result\n", " result = qamets_qatan1.sub(qamets_qatan_repl, result)\n", " if debug:\n", " dout.append((\"qamets_qatan1\", result))\n", " if count and pre != result:\n", " stats[\"qamets_qatan1\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = qamets_qatan2.sub(qamets_qatan_repl, result)\n", " if debug:\n", " dout.append((\"qamets_qatan2\", result))\n", " if count and pre != result:\n", " stats[\"qamets_qatan2\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = qamets_qatan3.sub(qamets_qatan_repl, result)\n", " if debug:\n", " dout.append((\"qamets_qatan3\", result))\n", " if count and pre != result:\n", "\n", " stats[\"qamets_qatan3\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = qamets_qatan4a.sub(qamets_qatan_repl, result)\n", " if debug:\n", " dout.append((\"qamets_qatan4a\", result))\n", " if count and pre != result:\n", " stats[\"qamets_qatan4a\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = qamets_qatan4b.sub(qamets_qatan_repl, result)\n", " if debug:\n", " dout.append((\"qamets_qatan4b\", result))\n", " if count and pre != result:\n", " stats[\"qamets_qatan4b\"] += 1\n", "\n", " return (result, dout) if debug else result" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Schwa and dagesh\n", "\n", "### Schwa\n", "\n", "The rules for the schwa that I have found are contradictory.\n", "\n", "These rules I have seen (e.g.)\n", "\n", "1. if two consecutive consonants have both a schwa, the second one is mobile;\n", "1. a schwa under a consonant with dagesh forte is mobile\n", "1. a schwa under the last consonant of a word is quiescens\n", "1. a schwa on a consonant that follows a long vowel, is mobile\n", "\n", "But there are examples where rules 1 and 3 apply at the same time.\n", "\n", "And the `qal 3 sg f` forms end with a tav with schwa, often preceded by a consonant with also schwa.\n", "In this case the tav has a dagesh, which by the rules for dagesh cannot be a lene. So it must be a forte.\n", "So this violates rule 2.\n", "\n", "We will cut this matter short, and make any final schwa quiescens.\n", "\n", "As to rule 4, there are cases where the schwa in question is also followed by a final consonant with schwa.\n", "In those cases it seems that the schwa in question is silent." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "# mobile schwa\n", "mobile_schwa1 = re.compile(\n", " r\"\"\"\n", " ( # here is what goes before the schwa in question\n", " (?:(?:\\A|[ &-]).\\.?)| # an initial consonant or\n", " (?:.\\.)| # a consonant with dagesh (which must be forte then) or\n", " (?::.\\.?)| # another schwa and then a consonant\n", " (?: # a long vowel such as the following\n", " (?:\n", " @>?| # qamets possibly with alef as mater lectionis (the remaining qametses are gadol)\n", " ;j?| # tsere, possibly followed by yod\n", " ij| # hireq with yod\n", " o[>w]?| # holam possibly followed by yod\n", " w\\. # waw with dagesh\n", " )\n", " {c} # and then a consonant\n", " )\n", " )\n", " :\n", " (?![@ae]) # the schwa may not be composite\n", "\"\"\".format(\n", " c=cons\n", " ),\n", " re.X,\n", ")" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "mobile_schwa2 = re.compile(\n", " r\":(?={b}(?:[^.]|[ &-]|\\Z))\".format(b=bgdkpt)\n", ") # before `BGDKPT` letter without dagesh" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "# second last consonant with schwa when last consonsoant also has schwa\n", "mobile_schwa3 = re.compile(r\"[%:](?={c}\\.?{a}?[%:](?:[ &]|\\Z))\".format(a=acc, c=cons))" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "# all schwas and the end of the word are quiescens, only if the words are not glued together\n", "mobile_schwa4 = re.compile(r\"[%:](?=[ &]|\\Z)\")" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "def mobile_schwa1_repl(match):\n", " return match.group(1) + \"%\"" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "# dagesh\n", "dages_forte_lene = re.compile(\n", " r\"(?<={v1})(-*)({b})\\.(?=[/!]?{v2})\".format(v1=vowel1, v2=vowel, b=bgdkpt)\n", ")\n", "dages_forte = re.compile(\n", " r\"(?<={v1})(-?[h>]*-*)([^h])\\.(?=[/!]?{v2})\".format(v1=vowel1, v2=vowel)\n", ")\n", "dages_lene = re.compile(r\"({b})\\.\".format(b=bgdkpt))" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "def dages_forte_lene_repl(match):\n", " return match.group(1) + (dagesh_lene_dict[match.group(2)] * 2)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "def dages_lene_repl(match):\n", " return dagesh_lene_dict[match.group(1)]" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def dages_forte_repl(match):\n", " return match.group(1) + match.group(2) * 2" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Mater lectionis and final fixes" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "# silent aleph\n", "silent_aleph = re.compile(\"(?<=[^ &-])>(?!(?:[/!]|{v}))\".format(v=vowel))" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "# final mater lectionis\n", "# I assume that heh and alef are only matrices lectionis after a LONG vowel\n", "last_ml = re.compile(r\"(?<={v1})[>h]+(?=[ &-]|\\Z)\".format(v1=lvowel1))\n", "last_ml_jw = re.compile(r\"jw(?=[ &-]|\\Z)\")" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "# mappiq heh\n", "mappiq_heh = re.compile(r\"h\\.\")" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "fixit_i = re.compile(r\"([{v}])\\.\".format(v=complex_i_vowel))\n", "fixit_w = re.compile(r\"([{v}])\\.\".format(v=complex_w_vowel))\n", "fixit = re.compile(r\"(.)\\.\")" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "split_sep = re.compile(\n", " \"^(.*?)([ .&$\\n-]*)$\"\n", ") # to split the result in the phono part and the interword part" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "def fixit_repl(match):\n", " return match.group(1) * 2" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "def fixit_i_repl(match):\n", " return match.group(1) + \"j\"" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def fixit_w_repl(match):\n", " return match.group(1) + \"w\"" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "END OF REGULAR EXPRESSIONS AND REPLACEMENT FUNCTIONS" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Qamets corrections\n", "\n", "For some words we need specific corrections.\n", "The rules for qamets qatan are not specific enough.\n", "\n", "### Correction mechanism\n", "\n", "We define a function `apply_corr(wordq, corr)` that can apply a correction instruction to ``wordq``, which is a word in pre-transliterated form, i.e. a word that has underwent transliteration steps ending with qamets interpretation, including applying special verb cases.\n", "\n", "The ``corr`` is a comma-separated list of basic instructions, which have the form\n", "*number* *letter*. It will interpret the *number*-th qamets as a gadol of qatan, depending on whether *letter* = ``ā`` or ``o``.\n", "\n", "### Precomputed list of corrections\n", "\n", "Later on we compile a dictionary ``qamets_corrections`` of pre-computed corrections.\n", "This dictionary is keyed by the pre-transliterated form, and valued by the corresponding correction string. Here we initialize this dictionary.\n", "\n", "The ``phono()`` function that carries out the complete transliteration, looks by default in ``qamets_corrections``, but this can be overridden. These corrections will not be carried out for the special verb cases." ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "qamets_corrections = {} # list of translits that must be corrected" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "apply correction instructions to a word" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def apply_corr(wordq, corr):\n", " if corr == \"\":\n", " return wordq\n", " corrs = corr.split(\",\")\n", " indices = []\n", " for (i, ch) in enumerate(wordq):\n", " if ch == \"^\" or (ch == \"@\" and (i == 0 or wordq[i - 1] != \":\")):\n", " indices.append(i)\n", " resultlist = list(wordq)\n", " for c in corrs:\n", " (pos, kind) = c\n", " pos = int(pos) - 1\n", " repl = \"^\" if kind == \"o\" else \"@\"\n", " if pos >= len(indices):\n", " TF.error(\"Line {}: pos={} out of range {}\".format(ln, pos, indices))\n", " continue\n", " rpos = indices[pos]\n", " resultlist[rpos] = repl\n", " return \"\".join(resultlist)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Feature value normalization\n", "\n", "We need concise, normalized values for the lexical features." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "undefs = {\"NA\", \"unknown\", \"n/a\", \"absent\"}" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "png = dict(\n", " NA=\"-\",\n", " unknown=\"-\",\n", " p1=\"1\",\n", " p2=\"2\",\n", " p3=\"3\",\n", " sg=\"s\",\n", " du=\"d\",\n", " pl=\"p\",\n", " m=\"m\",\n", " f=\"f\",\n", " a=\"a\",\n", " c=\"c\",\n", " e=\"e\",\n", ")\n", "png[\"n/a\"] = \"-\"" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Lexical info\n", "\n", "We need a label for lexical information such as part of speech, person, number, gender." ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "declensed = {\"subs\", \"nmpr\", \"adjv\", \"prps\", \"prde\", \"prin\"}" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "def get_lex_info(w):\n", " sp = F.sp.v(w)\n", " lex_infos = [sp]\n", " if sp == \"verb\":\n", " lex_infos.extend(\n", " [\n", " F.vs.v(w),\n", " F.vt.v(w),\n", " \"{}{}{}\".format(png[F.ps.v(w)], png[F.nu.v(w)], png[F.gn.v(w)]),\n", " ]\n", " )\n", " elif sp in declensed:\n", " lex_infos.append(\"{}{}\".format(png[F.nu.v(w)], png[F.gn.v(w)]))\n", " lex_info = \" \".join(lex_infos)\n", " if sp == \"verb\" or sp in declensed:\n", " prs = F.g_prs.v(w)\n", " if prs not in undefs:\n", " lex_info += \",{}\".format(prs.lower())\n", " return lex_info" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "def get_decl(lex_info):\n", " if lex_info is None:\n", " lex_info = \"\"\n", " parts = lex_info.split(\",\")\n", " return lex_info if len(parts) == 1 else parts[0]" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def get_prs(lex_info):\n", " if lex_info is None:\n", " lex_info = \"\"\n", " parts = lex_info.split(\",\")\n", " return \"\" if len(parts) == 1 else parts[1]" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# The phono function\n", "\n", "The definition of the function that generates the phonological transliteration.\n", "It is a function with a big definition, so we have broken it in parts.\n", "\n", "## Phono parts" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "interesting_stats = [\n", " \"total\",\n", " \"qamets_verb_suppress_qatan\",\n", " \"qamets_prs_suppress_qatan\",\n", " \"qamets_qatan_corrections\",\n", "]" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "if `suppress_in_verb`, phono will suppress qatan interpretation in certain verb paradigmatic forms\n", "if `suppress_in_prs`, phono will suppress qatan interpretation in pronominal suffixes\n", "if `correct` is 1, phono will apply individual corrections\n", "if `correct` is 0, phono will not apply individual corrections\n", "if `correct` is -1, phono will stop just before applying the qamets qatan corrections and return\n", "the intermediate result" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "def phono_qamets(\n", " ws,\n", " result,\n", " lex_info,\n", " debug,\n", " count,\n", " dout,\n", " suppress_in_verb,\n", " suppress_in_prs,\n", " correct,\n", " corrections,\n", "):\n", " # qamets qatan\n", "\n", " # check whether we are in a verb paradigm that requires suppressing qamets => qatan\n", " if count:\n", " pre = result\n", " suppr = True\n", " decl = get_decl(lex_info)\n", "\n", " if suppress_in_verb:\n", " suppr = False\n", " if decl == \"\":\n", " if debug:\n", " dout.append((\"qamets qatan\", \"no special verb form invoked\"))\n", " elif decl not in qamets_qatan_verb_x:\n", " if debug:\n", " dout.append((\"qamets qatan\", \"no special verb form: {}\".format(decl)))\n", " elif \"@\" not in result:\n", " if debug:\n", " dout.append((\"qamets qatan\", \"special verb form: no qamets present\"))\n", " elif \"!\" in result:\n", " if debug:\n", " dout.append(\n", " (\"qamets qatan\", \"special verb form: primary accent present\")\n", " )\n", " suppr = True\n", " else:\n", " suppr = True\n", " if count:\n", " stats[\"qamets_verb_suppress_qatan\"] += 1\n", " else:\n", " if debug:\n", " dout.append((\"qamets qatan\", \"suppression for verb forms is switched off\"))\n", " suppr = False\n", "\n", " if suppr:\n", " if debug:\n", " dout.append(\n", " (\n", " \"qamets qatan\",\n", " \"special verb form: qatan suppressed for {}\".format(decl),\n", " )\n", " )\n", " else:\n", " if debug:\n", " (result, this_dout) = doplainqamets(result, debug=True, count=count)\n", " dout.extend(this_dout)\n", " else:\n", " result = doplainqamets(result, count=count)\n", "\n", " # check whether we have a pronominal suffix that requires suppressing qamets => qatan\n", "\n", " if count:\n", " pre = result\n", " suppr = True\n", " prs = get_prs(lex_info)\n", " if suppress_in_prs:\n", " suppr = False\n", " if prs == \"\":\n", " if debug:\n", " dout.append((\"qamets qatan\", \"no pron suffix indicated\"))\n", " elif \"@\" not in prs:\n", " if debug:\n", " dout.append((\"qamets qatan\", \"pronominal suffix: no qamets present\"))\n", " elif not qamets_qatan_prs.search(result):\n", " if debug:\n", " dout.append(\n", " (\n", " \"qamets qatan\",\n", " \"pron suffix {}: no qamets qatan present\".format(prs),\n", " )\n", " )\n", " else:\n", " suppr = True\n", " if count:\n", " stats[\"qamets_prs_suppress_qatan\"] += 1\n", " else:\n", " if debug:\n", " dout.append((\"qamets qatan\", \"suppression for pron suffix is switched off\"))\n", " suppr = False\n", "\n", " if suppr:\n", " result = qamets_qatan_prs.sub(\"@\", result)\n", " if debug:\n", " dout.append(\n", " (\"qamets qatan\", \"pron suffix {}: qatan suppressed\".format(prs))\n", " )\n", " dout.append((\"qamets qatan prs\", result))\n", "\n", " # now change gadol in qatan in front of other qatan\n", " if count:\n", " pre = result\n", " result = qamets_qatan5.sub(qamets_qatan_repl, result)\n", " if debug:\n", " dout.append((\"qamets_qatan5\", result))\n", " if count and pre != result:\n", " stats[\"qamets_qatan5\"] += 1\n", "\n", " # handle desired corrections\n", " if count:\n", " pre = result\n", " if correct == -1:\n", " return (result, True)\n", " if correct == 1 and decl not in qamets_qatan_verb_x:\n", " if corrections is None:\n", " corrections = qamets_corrections\n", " parts = result.split(\"-\")\n", " hotpart = parts[-1]\n", " wordq = phono(ws[-1], correct=-1, punct=False)\n", " if wordq in corrections:\n", " hotpartn = apply_corr(hotpart, corrections[wordq])\n", " if debug:\n", " dout.append(\n", " (\"qamets qatan\", \"correction: {} => {}\".format(hotpart, hotpartn))\n", " )\n", " parts[-1] = hotpartn\n", " result = \"-\".join(parts)\n", " if debug:\n", " dout.append((\"qamets_qatan_corr\", result))\n", " if count and pre != result:\n", " stats[\"qamets_qatan_corrections\"] += 1\n", "\n", " return (result, False)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "def phono_patterns(result, debug, count, dout):\n", "\n", " # mobile schwa\n", " if count:\n", " pre = result\n", " result = mobile_schwa1.sub(mobile_schwa1_repl, result)\n", " if debug:\n", " dout.append((\"mobile_schwa1\", result))\n", " if count and pre != result:\n", " stats[\"mobile_schwa1\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = mobile_schwa2.sub(\"%\", result)\n", " if debug:\n", " dout.append((\"mobile_schwa2\", result))\n", " if count and pre != result:\n", " stats[\"mobile_schwa2\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = mobile_schwa3.sub(\"\", result)\n", " if debug:\n", " dout.append((\"mobile_schwa3\", result))\n", " if count and pre != result:\n", " stats[\"mobile_schwa3\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = mobile_schwa4.sub(\"\", result)\n", " if debug:\n", " dout.append((\"mobile_schwa4\", result))\n", " if count and pre != result:\n", " stats[\"mobile_schwa4\"] += 1\n", "\n", " # dagesh\n", " if count:\n", " pre = result\n", " result = dages_forte_lene.sub(dages_forte_lene_repl, result)\n", " if debug:\n", " dout.append((\"dagesh_forte_lene\", result))\n", " if count and pre != result:\n", " stats[\"dagesh_forte_lene\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = result.replace(\"ij.\", \"Ijj\")\n", " result = dages_forte.sub(dages_forte_repl, result)\n", " if debug:\n", " dout.append((\"dagesh_forte\", result))\n", " if count and pre != result:\n", " stats[\"dagesh_forte\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = dages_lene.sub(dages_lene_repl, result)\n", " if debug:\n", " dout.append((\"dagesh_lene\", result))\n", " if count and pre != result:\n", " stats[\"dagesh_lene\"] += 1\n", "\n", " # silent aleph (but not in tetra)\n", " if count:\n", " pre = result\n", " if \"[\" not in result:\n", " result = silent_aleph.sub(\"\", result)\n", " if debug:\n", " dout.append((\"silent_aleph\", result))\n", " if count and pre != result:\n", " stats[\"silent_aleph\"] += 1\n", "\n", " # final mater lectionis (but not in tetra)\n", " if count:\n", " pre = result\n", " if \"[\" not in result:\n", " result = last_ml_jw.sub(\"ʸw\", result)\n", " result = last_ml.sub(\"\", result)\n", " if debug:\n", " dout.append((\"last_ml\", result))\n", " if count and pre != result:\n", " stats[\"last_ml\"] += 1\n", "\n", " # mappiq heh\n", " if count:\n", " pre = result\n", " result = mappiq_heh.sub(\"h\", result)\n", " if debug:\n", " dout.append((\"mappiq_heh\", result))\n", " if count and pre != result:\n", " stats[\"mappiq_heh\"] += 1\n", "\n", " return result" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def phono_symbols(ws, result, debug, count, dout):\n", "\n", " # split the result in parts corresponding with the word nodes of the original\n", " resultparts = result.split(\"-\")\n", " results = []\n", " for (i, w) in enumerate(ws):\n", " resultp = resultparts[i]\n", " result = resultp\n", " # masora\n", " if F.qere.v(w) is not None:\n", " result = \"*\" + result\n", "\n", " for (sym, repl) in sound_dict.items():\n", " result = result.replace(sym, repl)\n", " if debug:\n", " dout.append((\"symbols\", result))\n", "\n", " # fix left over dagesh and mappiq\n", " if count:\n", " pre = result\n", " result = fixit_i.sub(fixit_i_repl, result)\n", " if debug:\n", " dout.append((\"fixit_i\", result))\n", " if count and pre != result:\n", " stats[\"fixit_i\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = fixit_w.sub(fixit_w_repl, result)\n", " if debug:\n", " dout.append((\"fixit_w\", result))\n", " if count and pre != result:\n", " stats[\"fixit_w\"] += 1\n", "\n", " if count:\n", " pre = result\n", " result = fixit.sub(fixit_repl, result)\n", " if count and pre != result:\n", " stats[\"fixit\"] += 1\n", " if debug:\n", " dout.append((\"fixit\", result))\n", "\n", " if count:\n", " pre = result\n", " for (sym, repl) in sound_dict2.items():\n", " result = result.replace(sym, repl)\n", " if debug:\n", " dout.append((\"punct\", result))\n", " if count and pre != result:\n", " stats[\"punct\"] += 1\n", "\n", " # zero width word boundary\n", " if count:\n", " pre = result\n", " result = multiple_space.sub(\" \", result)\n", " result = result.replace(\"[ \", \"[\").replace(\" ]\", \"]\") # tetra\n", " if debug:\n", " dout.append((\"cleanup\", result))\n", " if count and pre != result:\n", " stats[\"cleanup\"] += 1\n", " results.append(result)\n", "\n", " return results" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Phono whole\n", "Here the rule fabrics are woven together, exceptions invoked." ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def phono(\n", " ws,\n", " suppress_in_verb=True,\n", " suppress_in_prs=True,\n", " correct=1,\n", " corrections=None,\n", " inparts=False,\n", " debug=False,\n", " count=False,\n", " punct=True,\n", "):\n", " if type(ws) is int:\n", " ws = [ws]\n", " if count:\n", " stats[\"total\"] += 1\n", " dout = []\n", " # collect information\n", " orig = \"\".join(get_orig(w, punct=True) for w in ws)\n", " lex_info = get_lex_info(ws[-1])\n", " # strip punctuation at the end, if needed\n", " if not punct:\n", " orig = punctuation.sub(\"\", orig)\n", " # account for ketiv-qere if in debug mode\n", " if debug:\n", " for w in ws:\n", " if F.qere.v(w) is not None:\n", " dout.append(\n", " (\n", " \"ketiv-qere\",\n", " \"{} => {}\".format(\n", " F.g_word.v(w), F.qere.v(w) + F.qere_trailer.v(w)\n", " ),\n", " )\n", " )\n", " # accents\n", " if debug:\n", " (result, dout) = doaccents(orig, debug=True, count=count)\n", " else:\n", " result = doaccents(orig, count=count)\n", " # qamets\n", " (result, deliver) = phono_qamets(\n", " ws,\n", " result,\n", " lex_info,\n", " debug,\n", " count,\n", " dout,\n", " suppress_in_verb,\n", " suppress_in_prs,\n", " correct,\n", " corrections,\n", " )\n", " if deliver:\n", " return (result, dout) if debug else result\n", " # patterns\n", " result = phono_patterns(result, debug, count, dout)\n", " # symbols\n", " results = phono_symbols(ws, result, debug, count, dout)\n", " result = \"\".join(results) if not inparts else results\n", " # deliver\n", " return (result, dout) if debug else result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Skeleton analysis\n", "\n", "We have to do more work for the qamets. Sometimes a word form on its own is not enough to determine whether a qamets is gadol or qatan. In those cases, we analyse all occurrences of the same lexeme, and for each syllable position we measure whether an A-like vowel of an O-like vowel tends to occur in that syllable.\n", "\n", "In order to do that, we need to compute a *vowel skeleton* for each word." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Stripping paradigmatic material\n", "\n", "A word may have extra syllables, due to inflections, such as plurals, feminine forms, or suffixes. Let us call this the *paradigmatic material* of a word.\n", "\n", "Now, we strip from the initial vowel skeleton a number of trailing vowels that corresponds\n", "to the number of consonants found in the paradigmatic material.\n", "This is rather crude, but it will do." ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "# we need the number of letters in a defined value of a morpho feature\n", "def len_suffix(v):\n", " if v is None:\n", " return 0\n", " if v in undefs:\n", " return 0\n", " return len(v.replace(\"=\", \"\").replace(\"W\", \"\").replace(\"J\", \"\"))" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "# we need a function that return 1 for plural/dual subs/adj and for fem adj\n", "def len_ending(sp, n, g):\n", " if sp == \"subs\":\n", " return 1 if n in {\"pl\", \"du\"} else 0\n", " if sp == \"adjv\":\n", " return 1 if n in {\"pl\", \"du\"} or g in \"f\" else 0\n", " return 0" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "# return the number of consonants in the suffixes\n", "def len_morpho(w):\n", " return max(\n", " (\n", " len_suffix(F.prs.v(w)) + len_suffix(F.uvf.v(w)),\n", " len_ending(F.sp.v(w), F.nu.v(w), F.gn.v(w)),\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Skeleton patterns\n", "\n", "Next, we reduce the vowel skeleton to a skeleton pattern. We are not interested in all vowels, only in whether the vowel is a qamets (gadol or qatan), A-like, O-like, or other (which we dub E-like)." ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "# the qamets gadol/qatan skeleton\n", "qamets_qatan_skel = re.compile(\"([^@^])\")" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "# the vowel skeleton where the qamets gadol/qatan are preserved as @ and ^\n", "# another o-like vowel becomes O (holam, qamets chatuf) (no waws nor yods)\n", "# another a-like vowel becomes A (patah, patah chatuf) (no alefs)\n", "silent_alef_start = re.compile(r\"([ &-]|\\A)>([!/]?(?:[^!/.:;@^aeiou]|\\Z))\")" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "def silent_alef_start_repl(match):\n", " return match.group(1) + \"E\" + match.group(2)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "qamets_qatan_fullskel = re.compile(\n", " r\"\"\"\n", " (\n", " E # replacement of silent initial alef without vowels\n", " | (?::[@ae]?) # a (composite) schwa\n", " | (?:[;i]j) | (?:ow) | (?:w.) # a composite vowel\n", " | [@a;eiou^] # a vowel point\n", " | . # anything else\n", " )\n", "\"\"\",\n", " re.X,\n", ")" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [], "source": [ "def qamets_qatan_fullskel_repl(match):\n", " found = match.group(1)\n", " if found == \"E\":\n", " return \"E\"\n", " if found == \"@\":\n", " return gadol\n", " if found == \"^\":\n", " return qatan\n", " if found in a_like:\n", " return \"A\"\n", " if found in o_like:\n", " return \"O\"\n", " if found in e_like:\n", " return \"E\"\n", " return \"\"" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def get_full_skel(w, debug=False):\n", " wordq = phono(w, correct=-1, punct=False)\n", " wordqr = silent_alef_start.sub(silent_alef_start_repl, wordq)\n", " fullskel = qamets_qatan_fullskel.sub(qamets_qatan_fullskel_repl, wordqr)\n", " ending_length = len_morpho(w)\n", " relevant_part = len(fullskel) - ending_length\n", " if debug:\n", " TF.info(\n", " \"{}: {} => {} => {} : {} minus {} = {}\".format(\n", " w,\n", " orig,\n", " wordq,\n", " wordqr,\n", " fullskel,\n", " ending_length,\n", " fullskel[0:relevant_part],\n", " )\n", " )\n", "\n", " return fullskel[0:relevant_part]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Qamets gadol qatan: sophisticated\n", "\n", "A lot of work is needed to get the qamets gadol-qatan right.\n", "This involves looking at accents, verb paradigms and special cases among the non-verbs." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Qamets gadol qatan: non-verbs\n", "\n", "Sometimes a qamets is gadol or qatan for lexical reasons, i.e. it can not be derived by rules based on the word occurrence itself, but other occurrences have to be invoked.\n", "\n", "### All candidates" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 11s \tLooking for non-verb qamets\n" ] } ], "source": [ "# find lexemes which have an occurrence with a qamets (except verbs)\n", "utils.caption(0, \"\\tLooking for non-verb qamets\")\n", "qq_words = set()\n", "qq_lex = collections.defaultdict(lambda: [])" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 13s \t4056 lexemes and 13451 unique occurrences\n" ] } ], "source": [ "for w in F.otype.s(\"word\"):\n", " ln = F.languageISO.v(w)\n", " if ln != \"hbo\":\n", " continue\n", " sp = F.sp.v(w)\n", " if sp == \"verb\":\n", " continue\n", " orig = get_orig(w, punct=False, tetra=False)\n", " if \"@\" not in orig:\n", " continue # no qamets in word\n", " word = doaccents(orig)\n", " lex = F.lex.v(w)\n", " if word in qq_words:\n", " continue\n", " qq_words.add(word)\n", " qq_lex[lex].append(w)\n", "utils.caption(\n", " 0, \"\\t{} lexemes and {} unique occurrences\".format(len(qq_lex), len(qq_words))\n", ")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Filtering interesting candidates" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 13s \tFiltering lexemes with varied occurrences\n", "| 13s \t161 interesting lexemes with 1704 unique occurrences\n" ] } ], "source": [ "utils.caption(0, \"\\tFiltering lexemes with varied occurrences\")\n", "qq_varied = collections.defaultdict(lambda: [])\n", "nocc = 0\n", "for lex in qq_lex:\n", " ws = qq_lex[lex]\n", " if len(ws) == 1:\n", " continue\n", " occs = []\n", " skel_set = set()\n", " has_qatan = False\n", " has_gadol = False\n", " for w in ws:\n", " wordq = phono(w, correct=-1, punct=False)\n", " skel = (\n", " qamets_qatan_skel.sub(\"\", wordq.replace(\":@\", \"\"))\n", " .replace(\"@\", gadol)\n", " .replace(\"^\", qatan)\n", " )\n", " if gadol in skel:\n", " has_gadol = True\n", " if qatan in skel:\n", " has_qatan = True\n", " skel_set.add(skel)\n", " occs.append((skel, w))\n", " if len(skel_set) > 1 and has_qatan and has_gadol:\n", " for (skel, w) in occs:\n", " fullskel = get_full_skel(w)\n", " qq_varied[lex].append((skel, fullskel, w))\n", " nocc += 1\n", "utils.caption(\n", " 0,\n", " \"\\t{} interesting lexemes with {} unique occurrences\".format(len(qq_varied), nocc),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Guess the qamets" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [], "source": [ "qamets_qatan_xc = dict(\n", " (x[0], x[1]) for x in (y.split(\" => \") for y in qamets_qatan_x.strip().split(\"\\n\"))\n", ")\n", "qamets_qatan_xcompiled = collections.defaultdict(lambda: {})\n", "for (lex, corrstr) in qamets_qatan_xc.items():\n", " corrs = corrstr.split(\",\")\n", " for corr in corrs:\n", " (pos, ins) = corr\n", " pos = int(pos) - 1\n", " qamets_qatan_xcompiled[lex][pos] = ins" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [], "source": [ "def compile_occs(lex, occs):\n", " vowel_counts = collections.defaultdict(lambda: collections.Counter())\n", " for (skel, fullskel, w) in occs:\n", " for (i, c) in enumerate(fullskel):\n", " vowel_counts[i][c] += 1\n", " occs_compiled = {}\n", " for i in sorted(vowel_counts):\n", " vowel_count = vowel_counts[i]\n", " a_ish = vowel_count.get(gadol, 0) + vowel_count.get(\"A\", 0)\n", " o_ish = vowel_count.get(qatan, 0) + vowel_count.get(\"O\", 0)\n", " if a_ish != o_ish:\n", " occs_compiled[i] = gadol if a_ish > o_ish else qatan\n", " if lex in qamets_qatan_xcompiled:\n", " override = qamets_qatan_xcompiled[lex]\n", " for i in override:\n", " ins = override[i]\n", " old_ins = occs_compiled.get(i, \"\")\n", " new_ins = gadol if ins == \"A\" else qatan\n", " if old_ins == new_ins:\n", " TF.info(\n", " \"\\t{}: No override needed for syllable {} which is {}\".format(\n", " lex,\n", " i + 1,\n", " old_ins,\n", " ),\n", " tm=False,\n", " )\n", " else:\n", " TF.info(\n", " \"\\t{}: Override for syllable {}: {} becomes {}\".format(\n", " lex,\n", " i + 1,\n", " old_ins,\n", " new_ins,\n", " ),\n", " tm=False,\n", " )\n", " occs_compiled[i] = new_ins\n", " return occs_compiled" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "def guess_qq(occ, occs_compiled, debug=False):\n", " (skel, fullskel, w) = occ\n", " guess = \"\"\n", " for (i, c) in enumerate(fullskel):\n", " guess += occs_compiled.get(i, c) if c == gadol or c == qatan else c\n", " if debug:\n", " TF.info(\"{}\".format(w), tm=False)\n", " return guess" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def get_corr(fullskel, guess, debug=False):\n", " n = 0\n", " corr = []\n", " for (i, fc) in enumerate(fullskel):\n", " if fc != qatan and fc != gadol:\n", " continue\n", " n += 1\n", " gc = guess[i]\n", " if fc == gc:\n", " continue\n", " corr.append(\"{}{}\".format(n, gc))\n", " if debug:\n", " TF.info(\"{} guess {} corr {}\".format(fullskel, guess, corr), tm=False)\n", " return \",\".join(corr)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Carrying out the guess work" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 13s \tGuessing between gadol and qatan\n", "\tJM/: Override for syllable 1: ā becomes o\n", "\tBJT/: Override for syllable 1: o becomes ā\n", "\tJWMM: Override for syllable 2: becomes ā\n", "\tJHWNTN/: Override for syllable 2: becomes ā\n", "\tJRB {}): first {} and then {}\".format(\n", " lex,\n", " wordq,\n", " skel,\n", " fullskel,\n", " guess,\n", " old_corr,\n", " corr,\n", " )\n", " )\n", " nconflicts += 1\n", " qamets_corrections[wordq] = corr\n", "\n", " if this_ndiff_occs:\n", " ndiff_lexs += 1\n", " ndiff_occs += this_ndiff_occs\n", " qq_varied_remaining.add(lex)\n", "utils.caption(\n", " 0, \"\\t{} lexemes with modified occurrences ({})\".format(ndiff_lexs, ndiff_occs)\n", ")\n", "utils.caption(0, \"\\t{} patterns with conflicts\".format(nconflicts))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Generate phonological data" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "def stats_prog():\n", " return \" \".join(str(stats.get(stat, 0)) for stat in interesting_stats)" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 13s Generating data in two ways ... .\n", "..............................................................................................\n" ] } ], "source": [ "utils.caption(4, \"Generating data in two ways ... \")" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "phono_file = []\n", "word_file = []" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 14s \t 1000 verses 13316 62 0 21\n", "| 16s \t 2000 verses 27407 123 2 79\n", "| 17s \t 3000 verses 40963 174 5 125\n", "| 18s \t 4000 verses 54143 242 8 143\n", "| 19s \t 5000 verses 67151 308 13 171\n", "| 20s \t 6000 verses 82448 394 15 196\n", "| 22s \t 7000 verses 97551 457 17 254\n", "| 23s \t 8000 verses 113748 529 18 287\n", "| 24s \t 9000 verses 129602 573 20 327\n", "| 26s \t10000 verses 146217 624 20 438\n", "| 27s \t11000 verses 159809 749 20 487\n", "| 28s \t12000 verses 174192 891 24 524\n", "| 30s \t13000 verses 190555 1018 28 576\n", "| 31s \t14000 verses 205104 1168 32 622\n", "| 32s \t15000 verses 218610 1290 33 728\n", "| 33s \t16000 verses 227944 1336 39 777\n", "| 34s \t17000 verses 235635 1379 48 827\n", "| 35s \t18000 verses 243258 1396 51 866\n", "| 35s \t19000 verses 250709 1429 59 906\n", "| 36s \t20000 verses 260118 1470 60 960\n", "| 38s \t21000 verses 275083 1533 63 979\n", "| 39s \t22000 verses 286442 1590 65 1007\n", "| 40s \t23000 verses 301302 1645 66 1075\n" ] } ], "source": [ "stats = collections.Counter()\n", "nv = 0\n", "nchunk = 1000\n", "nvc = 0\n", "for v in F.otype.s(\"verse\"):\n", " nv += 1\n", " nvc += 1\n", " if nvc == nchunk:\n", " utils.caption(0, \"\\t{:>5} verses {}\".format(nv, stats_prog()))\n", " nvc = 0\n", "\n", " words = partition_w(L.d(v, \"word\"))\n", " phonos = []\n", "\n", " for ws in words:\n", " lws = len(ws)\n", " phono_w = phono(ws, inparts=True, count=True)\n", " phono_file.append(\"\".join(phono_w))\n", " for (i, w) in enumerate(ws):\n", " (real_phono, sep) = phono_sep.fullmatch(phono_w[i]).groups()\n", " word_file.append((w, real_phono, sep))\n", "\n", " if not phono_file[-1].endswith(\". \"):\n", " word_file.append((None, \"\", \"+\"))" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 40s \t23213 verses done 304800 1650 66 1081\n", "| 40s \t 270191 accents\n", "| 40s \t 9006 cleanup\n", "| 40s \t 45235 dagesh_forte\n", "| 40s \t 21511 dagesh_forte_lene\n", "| 40s \t 59612 dagesh_lene\n", "| 40s \t 16322 default_accent\n", "| 40s \t 968 fixit\n", "| 40s \t 2658 furtive_patah\n", "| 40s \t 28195 last_ml\n", "| 40s \t 2201 mappiq_heh\n", "| 40s \t 93898 mobile_schwa1\n", "| 40s \t 2255 mobile_schwa2\n", "| 40s \t 179 mobile_schwa3\n", "| 40s \t 7702 mobile_schwa4\n", "| 40s \t 25498 punct\n", "| 40s \t 25498 punctuation\n", "| 40s \t 66 qamets_prs_suppress_qatan\n", "| 40s \t 5257 qamets_qatan1\n", "| 40s \t 243 qamets_qatan2\n", "| 40s \t 1791 qamets_qatan3\n", "| 40s \t 28 qamets_qatan4a\n", "| 40s \t 256 qamets_qatan4b\n", "| 40s \t 209 qamets_qatan5\n", "| 40s \t 1081 qamets_qatan_corrections\n", "| 40s \t 1650 qamets_verb_suppress_qatan\n", "| 40s \t 12 rafe\n", "| 40s \t 21098 silent_aleph\n", "| 40s \t 304800 total\n", "| 40s \t 304796 trim\n" ] } ], "source": [ "utils.caption(0, \"\\t{:>5} verses done {}\".format(nv, stats_prog()))\n", "for stat in sorted(stats):\n", " amount = stats[stat]\n", " utils.caption(\n", " 0,\n", " \"\\t{:<1} {:>6} {}\".format(\n", " \"#\" if amount == 0 else \"\",\n", " amount,\n", " stat,\n", " ),\n", " )" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Consistency check\n", "\n", "We take the just generated `phono` and `wordph` files.\n", "From the `phono` file we strip the passage indicators, and from the `wordph` we strip the node numbers.\n", "\n", "They should be consistent." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 40s 304800 items in phono\n", " 40s Reading word\n", "| 41s \t23213 lines\n" ] } ], "source": [ "utils.caption(0, \"{} items in phono\".format(len(phono_file)))\n", "word_test = []\n", "TF.info(\"Reading word\")\n", "i = 0\n", "for (w, mat, sep) in word_file:\n", " rsep = \"\" if sep == \"+\" else sep\n", " word_test.append(mat + rsep)\n", " if \". \" in sep or \"+\" in sep:\n", " i += 1\n", "utils.caption(0, \"\\t{} lines\".format(i))" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 41s \tOK: phono text and word info are CONSISTENT\n" ] } ], "source": [ "phono_text = \"\".join(phono_file)\n", "word_text = \"\".join(word_test)\n", "if phono_text != word_text:\n", " utils.caption(0, \"\\tERROR: phono text and word info are NOT consistent\")\n", "else:\n", " utils.caption(0, \"\\tOK: phono text and word info are CONSISTENT\")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Generating phono module for Text-Fabric\n", "\n", "We generate the features `phono` and `phono_trailer`.\n", "They are defined for words.\n", "\n", "We also generate a config feature `otext@phono`, which will be picked up by Text-Fabric automatically.\n", "In it we define the phonetic *format*, so that Text-Fabric has can output text in phonetic representation." ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [], "source": [ "genericMetaPath = f\"{thisRepo}/yaml/generic.yaml\"\n", "phonoMetaPath = f\"{thisRepo}/yaml/phono.yaml\"\n", "\n", "with open(genericMetaPath) as fh:\n", " genericMeta = yaml.load(fh, Loader=yaml.FullLoader)\n", " genericMeta[\"version\"] = VERSION\n", "with open(phonoMetaPath) as fh:\n", " phonoMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))\n", " \n", "metaData = {\"\": genericMeta, **phonoMeta}" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 3m 09s Writing TF phono features .\n", "..............................................................................................\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "utils.caption(4, \"Writing TF phono features\")\n", "nodeFeatures = dict(\n", " phono=dict(((ln[0], ln[1]) for ln in word_file if ln[0] is not None)),\n", " phono_trailer=dict(((ln[0], ln[2]) for ln in word_file if ln[0] is not None)),\n", ")\n", "edgeFeatures = {}\n", "metaData[\"otext@phono\"] = {\n", " \"about\": \"Provides phonetic transcriptions to Hebrew Words\",\n", " \"see\": \"https://github.com/ETCBC/phono\",\n", " \"fmt:text-phono-full\": \"{phono}{phono_trailer}\",\n", "}\n", "metaData[\"phono\"][\"valueType\"] = \"str\"\n", "metaData[\"phono_trailer\"][\"valueType\"] = \"str\"\n", "\n", "TF = Fabric(locations=thisTempTf, silent=True)\n", "TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Diffs\n", "\n", "Check differences with previous versions." ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 6m 08s Check differences with previous version .\n", "..............................................................................................\n", "| 6m 08s \tno features to add\n", "| 6m 08s \tno features to delete\n", "| 6m 08s \t2 features in common\n", "| 6m 08s phono ... no changes\n", "| 6m 08s phono_trailer ... no changes\n", "| 6m 08s Done\n" ] } ], "source": [ "utils.checkDiffs(thisTempTf, thisTf, only=set(nodeFeatures))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Deliver\n", "\n", "Copy the new TF features from the temporary location where they have been created to their final destination." ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 6m 11s Deliver data set to /Users/werk/github/etcbc/phono/tf/2021 .\n", "..............................................................................................\n" ] } ], "source": [ "utils.deliverDataset(thisTempTf, thisTf)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Compile TF" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 6m 14s Load and compile the new TF features .\n", "..............................................................................................\n" ] } ], "source": [ "utils.caption(4, \"Load and compile the new TF features\")" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 9.1.7\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "117 features found and 0 ignored\n", " 0.00s loading features ...\n", " | 0.00s Dataset without structure sections in otext:no structure functions in the T-API\n", " | 1.01s T phono from ~/github/etcbc/phono/tf/2021\n", " | 0.60s T phono_trailer from ~/github/etcbc/phono/tf/2021\n", " 15s All features loaded/computed - for details use TF.isLoaded()\n" ] }, { "data": { "text/plain": [ "[('Computed',\n", " 'computed-data',\n", " ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),\n", " ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),\n", " ('Fabric', 'loading', ('TF',)),\n", " ('Locality', 'locality', ('L Locality',)),\n", " ('Nodes', 'navigating-nodes', ('N Nodes',)),\n", " ('Features',\n", " 'node-features',\n", " ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),\n", " ('Search', 'search', ('S Search',)),\n", " ('Text', 'text', ('T Text',))]" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TF = Fabric(locations=[coreTf, thisTf], modules=[\"\"])\n", "api = TF.load(\" \".join(nodeFeatures))\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 6m 33s Basic tests .\n", "..............................................................................................\n" ] } ], "source": [ "utils.caption(4, \"Basic tests\")" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 6m 36s First verses in phonetic transcription .\n", "..............................................................................................\n", "Genesis 1:1\n", "bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . \n", "Genesis 1:2\n", "wᵊhāʔˈāreṣ hāyᵊṯˌā ṯˈōhû wāvˈōhû wᵊḥˌōšeḵ ʕal-pᵊnˈê ṯᵊhˈôm wᵊrˈûₐḥ ʔᵉlōhˈîm mᵊraḥˌefeṯ ʕal-pᵊnˌê hammˈāyim . \n", "Genesis 1:3\n", "wayyˌōmer ʔᵉlōhˌîm yᵊhˈî ʔˈôr wˈayᵊhî-ʔˈôr . \n", "Genesis 1:4\n", "wayyˈar ʔᵉlōhˈîm ʔeṯ-hāʔˌôr kî-ṭˈôv wayyavdˈēl ʔᵉlōhˈîm bˌên hāʔˌôr ûvˌên haḥˈōšeḵ . \n", "Genesis 1:5\n", "wayyiqrˌā ʔᵉlōhˈîm lāʔôr yˈôm wᵊlaḥˌōšeḵ qˈārā lˈāyᵊlā wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm ʔeḥˈāḏ . f \n", "Genesis 1:6\n", "wayyˈōmer ʔᵉlōhˈîm yᵊhˌî rāqˌîₐʕ bᵊṯˈôḵ hammˈāyim wiyhˈî mavdˈîl bˌên mˌayim lāmˈāyim . \n", "Genesis 1:7\n", "wayyˈaʕaś ʔᵉlōhîm ʔeṯ-hārāqîˌₐʕ wayyavdˈēl bˈên hammˈayim ʔᵃšˌer mittˈaḥaṯ lārāqˈîₐʕ ûvˈên hammˈayim ʔᵃšˌer mēʕˈal lārāqˈîₐʕ wˈayᵊhî-ḵˈēn . \n", "Genesis 1:8\n", "wayyiqrˈā ʔᵉlōhˈîm lˈārāqˌîₐʕ šāmˈāyim wˈayᵊhî-ʕˌerev wˈayᵊhî-vˌōqer yˌôm šēnˈî . f \n", "Genesis 1:9\n", "wayyˈōmer ʔᵉlōhˈîm yiqqāwˌû hammˈayim mittˈaḥaṯ haššāmˈayim ʔel-māqˈôm ʔeḥˈāḏ wᵊṯērāʔˌeh hayyabbāšˈā wˈayᵊhî-ḵˈēn . \n", "Genesis 1:10\n", "wayyiqrˌā ʔᵉlōhˈîm layyabbāšˌā ʔˈereṣ ûlᵊmiqwˌē hammˌayim qārˈā yammˈîm wayyˌar ʔᵉlōhˌîm kî-ṭˈôv . \n" ] } ], "source": [ "utils.caption(4, \"First verses in phonetic transcription\")\n", "for v in F.otype.s(\"verse\")[0:10]:\n", " utils.caption(0, \"{} {}:{}\".format(*T.sectionFromNode(v)), continuation=True)\n", " utils.caption(0, T.text(L.d(v, \"word\"), fmt=\"text-phono-full\"), continuation=True)" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 6m 41s First verse in all formats .\n", "..............................................................................................\n", "lex-orig-full\n", "\tבְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ \n", "lex-orig-plain\n", "\tב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ \n", "lex-trans-full\n", "\tB.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY \n", "lex-trans-plain\n", "\tB R>CJT BR> >LHJM >T H CMJM W >T H >RY \n", "text-orig-full\n", "\tבְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "text-orig-full-ketiv\n", "\tבְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ \n", "text-orig-plain\n", "\tבראשׁית ברא אלהים את השׁמים ואת הארץ׃ \n", "text-phono-full\n", "\tbᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ . \n", "text-trans-full\n", "\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "text-trans-full-ketiv\n", "\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "text-trans-plain\n", "\tBR>CJT BR> >LHJM >T HCMJM W>T H>RY00 \n" ] } ], "source": [ "utils.caption(4, \"First verse in all formats\")\n", "for fmt in T.formats:\n", " utils.caption(0, \"{}\".format(fmt), continuation=True)\n", " utils.caption(0, \"\\t{}\".format(T.text(range(1, 12), fmt=fmt)), continuation=True)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# End of pipeline\n", "\n", "If this notebook is run with the purpose of generating data, this is the end then.\n", "\n", "After this tests and examples are run." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "if SCRIPT:\n", " stop(good=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing\n", "\n", "The function below reads a text file with tests.\n", "\n", "A test is a tab separated line with as fields:\n", "\n", " passage ETCBC-original phono-transcription expected-result bol-reference comments\n", "\n", "The testing routine executes all tests, checks the results, produces on-screen output, debug output in file, and pretty output in a HTML file." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Load the features needed for testing." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s loading features ...\n", " 0.04s All features loaded/computed - for details use loadLog()\n" ] } ], "source": [ "api = TF.load(\n", " \"\"\"\n", " qere qere_trailer\n", " g_word_utf8 g_cons_utf8 trailer\n", " g_word g_cons lex_utf8 lex lex0\n", " sp vs vt gn nu ps st\n", " uvf prs g_prs pfm vbs vbe\n", " languageISO\n", "\"\"\"\n", ")\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Auxiliary functions\n", "\n", "### Composing tests\n", "\n", "Given an occurrence in ETCBC transliteration in a passage, or a node number, we want to easily compile a test out of it.\n", "Say we are looking for ``orig``.\n", "\n", "The match need not be perfect.\n", "We want to find the node `w`, which carries a transliteration that occurs at the end of ``orig``.\n", "If there are multiple, we want the longest.\n", "If there are multiple longest ones, we want the first that occurs in the passage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_hebrew(orig):\n", " origm = Transcription.suffix_and_finales(orig)\n", " return Transcription.to_hebrew(origm[0] + origm[1]).replace(\"-\", \"\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_passage(w):\n", " return T.sectionFromNode(w, lang=\"la\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def tupleFromStr(passage):\n", " (book, rest) = passage.split()\n", " (chapter, verse) = rest.split(\":\")\n", " return (book, int(chapter), int(verse))" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def maketest(ws=None, orig=None, passageStr=None, expected=None, comment=None):\n", " if comment is None:\n", " comment = \"isolated case\"\n", " passage = None if passageStr is None else tupleFromStr(passageStr)\n", " if ws is None:\n", " if passage is not None and orig is not None:\n", " ws = find_w(passage, orig)\n", " if ws is None:\n", " TF.error(\"Cannot make test: {}: {} not found\".format(passageStr, orig))\n", " return None\n", " else:\n", " if type(ws) is int:\n", " ws = [ws]\n", " passage = get_passage(ws[-1])\n", " if expected is None:\n", " expected = phono(ws, punct=False)\n", " test = (ws, expected.rstrip(\" \"), comment)\n", " return test" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Formatting test results\n", "\n", "Here are some HTML/CSS definitions for formatting test results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def h_esc(txt):\n", " return txt.replace(\"&\", \"&\").replace(\"<\", \"<\").replace(\">\", \">\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def test_html_head(title, stats, mystats):\n", " return (\n", " \"\"\"\n", "\n", " \n", " \"\"\"\n", " + title\n", " + \"\"\"\n", " \n", "\n", "\"\"\"\n", " + ((\"

\" + stats + \"

\") if stats else \"\")\n", " + ((\"

\" + mystats + \"

\") if mystats else \"\")\n", " + \"\"\"\n", "\n", "\"\"\"\n", " )" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "test_html_tail = \"\"\"
\n", "\n", "\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Run tests\n", "\n", "This is the function that runs a sequence of tests.\n", "If the second argument is a string, it reads a tab separated file with tests from a file with that name.\n", "Otherwise it should be a list of tests, a test being a list or tuple consisting of:\n", "\n", " source, orig, lex-info, expected, comment\n", "\n", "where ``source`` is either a string ``passage`` or a number ``w``.\n", "If it is a ``w``, it is the node corresponding to the word, and it is used to get the ``passage, orig, lex_info`` which are allowed to be empty.\n", "If it is a ``passage``, the node will be looked up on the basis of it plus ``orig``.\n", "If the node is found, it will be used to get the ``lex_info``, if not, the given ``lex_info`` will be used." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def vfname(inpath):\n", " (indir, infile) = os.path.split(inpath)\n", " (inbase, inext) = os.path.splitext(infile)\n", " return os.path.join(indir, inbase + VERSION + inext)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def runtests(title, testsource, outfilename, htmlfilename, order=True, screen=False):\n", " skipped = 0\n", " if type(testsource) is list:\n", " tests = testsource\n", " else:\n", " tests = []\n", " test_in_file = open(testsource)\n", " for tline in test_in_file:\n", " (passageStr, orig, expected, comment) = tline.rstrip(\"\\n\").split(\"\\t\")\n", " this_test = maketest(\n", " orig=orig, passageStr=passageStr, expected=expected, comment=comment\n", " )\n", " if this_test is not None:\n", " tests.append(this_test)\n", " else:\n", " skipped += 1\n", " test_in_file.close()\n", "\n", " lines = []\n", " htmllines = []\n", " longlines = []\n", " nexact = 0\n", " ngood = 0\n", " ntests = len(tests)\n", " test_sequence = sorted(tests, key=lambda x: (x[1], x[2], x[0])) if order else tests\n", "\n", " for (i, (wset, expected, comment)) in enumerate(test_sequence):\n", " passage = get_passage(wset[-1])\n", " passageStr = \"{} {}:{}\".format(*passage)\n", " wss = partition_w(wset)\n", " orig = \"\".join(get_orig(w, punct=True, set_pet=True, tetra=False) for w in wset)\n", " wordph = \"\"\n", " lex_info = \"\"\n", " dout = []\n", " for (j, ws) in enumerate(wss):\n", " this_lex_info = get_lex_info(ws[-1])\n", " (this_wordph, this_dout) = phono(\n", " ws, punct=not (j == len(wss) - 1), debug=True\n", " )\n", " wordph += this_wordph\n", " lex_info += this_lex_info\n", " dout.extend(this_dout)\n", " wordph = wordph.rstrip(\" \")\n", " if wordph == expected:\n", " isgood = \"=\"\n", " nexact += 1\n", " elif wordph.replace(\"ˌ\", \"\").replace(\"ˈ\", \"\").replace(\n", " \"-\", \"\"\n", " ) == expected.replace(\"ˌ\", \"\").replace(\"ˈ\", \"\").replace(\"-\", \"\"):\n", " isgood = \"~\"\n", " ngood += 1\n", " else:\n", " isgood = \"#\"\n", " line_text = \"{:>3} {:<19} {:>6} {:<17} {:<22} {:<20} {} {:<20}\".format(\n", " i + 1,\n", " passageStr,\n", " ws[-1],\n", " lex_info,\n", " orig,\n", " wordph,\n", " isgood,\n", " \"\" if isgood == \"=\" else expected,\n", " )\n", " lines.append(line_text)\n", " if screen:\n", " if isgood in {\"=\", \"~\"}:\n", " TF.info(line_text, tm=False)\n", " if isgood not in {\"=\", \"~\"}:\n", " TF.info(line_text, tm=False)\n", " longlines.append(\n", " \"{:>3} {:<19} {:>6} {:<17} {:<25} => {:<25} < {} {:<25} # {}\\n{}\\n\\n\".format(\n", " i + 1,\n", " passageStr,\n", " ws[-1],\n", " lex_info,\n", " orig,\n", " wordph,\n", " isgood,\n", " \"\" if isgood == \"=\" else expected,\n", " comment,\n", " \"\\n\".join(\"{:<7} {:<20} {}\".format(\"\", x[0], x[1]) for x in dout),\n", " )\n", " )\n", " htmllines.append(\n", " (\n", " \"\"\"\n", " \n", " {i}\n", " {v} {w}\n", " {t}\n", " {h}\n", " {l}\n", " {p}\n", " {e}\n", " {c}\n", " \n", " \"\"\"\n", " ).format(\n", " st=\"exact\" if isgood == \"=\" else \"good\" if isgood == \"~\" else \"error\",\n", " i=i + 1,\n", " v=passageStr,\n", " w=\"\" if w is None else w,\n", " t=h_esc(orig),\n", " l=lex_info,\n", " h=get_hebrew(orig),\n", " p=wordph,\n", " e=\"\" if isgood == \"=\" else expected,\n", " est=\"\" if isgood == \"=\" else \" ca\" if isgood == \"~\" else \" norm\",\n", " c=h_esc(comment),\n", " )\n", " )\n", "\n", " line_text = \"\\n\".join(lines)\n", " longline_text = \"\\n\".join(longlines)\n", " test_out_file = open(vfname(outfilename), \"w\")\n", " test_out_file.write(\"{}\\n\\n{}\\n\".format(line_text, longline_text))\n", " stats = \"{} tests; {} skipped; {} failed; {} passed of which {} exactly.\".format(\n", " ntests + skipped,\n", " skipped,\n", " ntests - ngood - nexact,\n", " ngood + nexact,\n", " nexact,\n", " )\n", " TF.info(\n", " \"ntests={}, skipped={}, ngood={}, nexact={}\".format(\n", " ntests, skipped, ngood, nexact\n", " )\n", " )\n", " test_out_file.close()\n", " test_html_file = open(vfname(htmlfilename), \"w\")\n", " test_html_headline = \"\"\"\n", " \n", " v\n", " verse\n", " etcbc\n", " hebrew\n", " lexical\n", " phono\n", " expected\n", " comment\n", " \n", " \"\"\"\n", " test_html_file.write(\n", " \"{}{}{}{}\".format(\n", " test_html_head(title, stats, \"\"),\n", " test_html_headline,\n", " \"\".join(htmllines),\n", " test_html_tail,\n", " )\n", " )\n", " test_html_file.close()\n", " TF.info(stats, tm=False)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Produce showcases\n", "\n", "This is a variant on ``runtests()``.\n", "\n", "It produces overviews of the cases where the corpus dependent rules have been applied." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def showcases(title, stats, testsource, order=True):\n", " ctitle = title + \" cases\"\n", " ttitle = title + \" tests\"\n", " fctitle = ctitle.replace(\" \", \"_\")\n", " fttitle = ttitle.replace(\" \", \"_\")\n", " test_file_name = vfname(fttitle + \".txt\")\n", " html_file_name = vfname(fctitle + \".html\")\n", "\n", " TF.info(\"Generating HTML in {}\".format(html_file_name))\n", " TF.info(\"Generating test set {} in {}\".format(title, test_file_name))\n", "\n", " htmllines = []\n", " ncorr = 0\n", " test_sequence = (\n", " sorted(testsource, key=lambda x: (x[3], x[0], x[1], x[5]))\n", " if order\n", " else testsource\n", " )\n", " ntests = len(testsource)\n", "\n", " test_file = open(test_file_name, \"w\")\n", " for (i, (corr, wordph, wordph_c, lex, orig, w, comment)) in enumerate(\n", " test_sequence\n", " ):\n", " passage = get_passage(w)\n", " passageStr = \"{} {}:{}\".format(*passage)\n", " lex_info = get_lex_info(w)\n", " test_file.write(\n", " \"{}\\t{}\\t{}\\t{}\\n\".format(\n", " passageStr,\n", " orig,\n", " wordph_c,\n", " comment,\n", " )\n", " )\n", " heb = get_hebrew(orig)\n", " if corr:\n", " ncorr += 1\n", " htmllines.append(\n", " (\n", " \"\"\"\n", " \n", " {i}\n", " {cr}\n", " {tl}\n", " {v} {w}\n", " {l}\n", " {h}\n", " {p}\n", " {pc}\n", " {t}\n", " {c}\n", " \n", "\"\"\"\n", " ).format(\n", " i=i + 1,\n", " st=\" cr\" if corr else \"\",\n", " st1=\" good\" if corr else \"\",\n", " cr=corr,\n", " tl=h_esc(lex),\n", " v=passageStr,\n", " w=\"\" if w is None else w,\n", " l=lex_info,\n", " h=heb,\n", " p=wordph if wordph != wordph_c else \"\",\n", " pc=wordph_c,\n", " t=h_esc(orig),\n", " c=h_esc(comment),\n", " )\n", " )\n", " test_file.close()\n", "\n", " mystats = \"{} occurrences and {} corrections\".format(\n", " ntests,\n", " ncorr,\n", " )\n", " test_html_headline = \"\"\"\n", " \n", " n\n", " correction\n", " lexeme\n", " verse\n", " lexical\n", " hebrew\n", " phono
uncorrected\n", " phono
corrected\n", " etcbc\n", " comment\n", " \n", " \"\"\"\n", " test_html_file = open(html_file_name, \"w\")\n", " test_html_file.write(\n", " \"{}{}{}{}\".format(\n", " test_html_head(ctitle, stats, mystats),\n", " test_html_headline,\n", " \"\".join(htmllines),\n", " test_html_tail,\n", " )\n", " )\n", " test_html_file.close()\n", " if stats:\n", " TF.info(stats, tm=False)\n", " if mystats:\n", " TF.info(mystats, tm=False)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "## Test the existing examples" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 9.49s ntests=86, skipped=0, ngood=19, nexact=67\n", "86 tests; 0 skipped; 0 failed; 86 passed of which 67 exactly.\n", " 10s ntests=1574, skipped=0, ngood=197, nexact=1377\n", "1574 tests; 0 skipped; 0 failed; 1574 passed of which 1377 exactly.\n", " 10s ntests=513, skipped=0, ngood=30, nexact=483\n", "513 tests; 0 skipped; 0 failed; 513 passed of which 483 exactly.\n", " 11s ntests=209, skipped=0, ngood=30, nexact=179\n", "209 tests; 0 skipped; 0 failed; 209 passed of which 179 exactly.\n" ] } ], "source": [ "for tname in [\n", " \"mixed\",\n", " \"qamets_nonverb_tests\",\n", " \"qamets_verb_tests\",\n", " \"qamets_prs_tests\",\n", "]:\n", " runtests(\n", " tname,\n", " \"{}.txt\".format(tname),\n", " \"{}_debug.txt\".format(tname),\n", " \"{}.html\".format(tname),\n", " screen=False,\n", " )" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Testing: Special cases" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "special_tests = [\n", " dict(passageStr=\"Joel 1:17\", orig=\"<@B:C74W.\", comment=\"qamets gadol or qatan\"),\n", " dict(ws=7494, expected=None, comment=\"schwa in front of BGDKPT without dagesh\"),\n", " dict(ws=5, expected=None, comment=\"article in isolation\"),\n", " dict(ws=6, expected=None, comment=\"word after article in isolation\"),\n", " dict(ws=106, expected=None, comment=\"proclitic min\"),\n", " dict(\n", " ws=107, expected=None, comment=\"word starting with BGDKPT after proclitic min\"\n", " ),\n", " dict(\n", " passageStr=\"Genesis 1:7\",\n", " orig=\"MI-T.A74XAT\",\n", " expected=None,\n", " comment=\"proclitic min combined with word starting with BGDKPT\",\n", " ),\n", " dict(ws=1684, expected=None, comment=\"Tetra with end of verse\"),\n", " dict(\n", " passageStr=\"Genesis 4:1\",\n", " orig=\"J:HW@75H00\",\n", " expected=None,\n", " comment=\"Tetra with end of verse\",\n", " ),\n", " dict(ws=27477, expected=None, comment=\"pronominal suffix after verb\"),\n", " dict(ws=155387, expected=None, comment=\"peculiar representation of tetragrammaton\"),\n", " dict(\n", " passageStr=\"Proverbia 10:10\",\n", " orig=\"HLH\", expected=None, comment=\"ketiv qere\"),\n", " dict(\n", " passageStr=\"Genesis 1:27\",\n", " orig=\"H@95->@D@M03\",\n", " expected=None,\n", " comment=\"qamets gadol\",\n", " ),\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "compiled_tests = []\n", "for t in special_tests:\n", " this_test = maketest(**t)\n", " if this_test is not None:\n", " compiled_tests.append(this_test)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1 Genesis 9:21 4420 subs sm,+h >@H:@LO75W00 *ʔohᵒlˈô = \n", " 2 Genesis 4:1 1685 nmpr sm, J:HW@75H00 [yᵊhwˈāh] = \n", " 3 Genesis 17:11 7494 subs sm, B.:FA74R bᵊśˈar = \n", " 4 Genesis 1:27 539 subs sm, H@95->@D@M03 hˈāʔāḏˌām = \n", " 5 Genesis 1:1 6 art HA- hˌa = \n", " 6 Genesis 1:7 108 subs sm, MI-T.A74XAT mittˈaḥaṯ = \n", " 7 Genesis 1:7 107 prep MI- mˌi = \n", " 8 Samuel_I 23:3 155387 nmpr s-, Q:ET& ʔeṯ- = \n", " 10 Genesis 48:9 27477 prep >;LA73J ʔēlˌay = \n", " 11 Genesis 1:1 5 prep >;71T ʔˌēṯ = \n", " 12 Genesis 1:7 106 conj >:ACER03 ʔᵃšˌer = \n", " 13 Proverbia 10:10 349420 subs sf,