{ "cells": [ { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "\n", "\n", "\n", "\n", "\n", "# Phonetic Transliteration of Hebrew Masoretic Text\n", "\n", "# Frequently asked questions\n", "\n", "Q: *What is the use of a phonetic transliteration of the Hebrew Bible? What can anyone wish beyond the careful, meticulous Masoretic system of consonants, vowels and accents?*\n", "\n", "A: Several things:\n", "\n", "* the Hebrew Bible may be subject of study in various fields,\n", " where the people involved do not master the Hebrew script;\n", " a phonetic transcription removes a hurdle for them.\n", "* in computational linguistics there are many tools that deal with written language in Latin alphabets;\n", " even a simple task as getting the consonant-vowel pattern of a word is unnecessarily complicated\n", " when using the Hebrew script.\n", "* in phonetics and language learning theory, it is important to represent the sounds without being burdened\n", " by the idiosyncracies of the writing system and the spelling.\n", "\n", "Q: *But surely, there already exist transliterations of Hebrew? Why not use them?*\n", "\n", "Here are a few pragmatic reasons:\n", "\n", "* we want to be able to *compute* a transliteration based upon our own data;\n", "* we want to gain insight in to what extent the transliteration can be purely rule-based, and to what extent\n", " it depends on lexical information that you just need to know;\n", "* we want to make available a well documented transliteration, that can be studied, borrowed and improved by others.\n", "\n", "Q: *But how **good** is your transliteration?*\n", "\n", "we do not know, ..., yet. A few remarks though:\n", "\n", "* we have applied most of the *rules* that we could find in Hebrew grammars;\n", "* we have suspended some of the rules for some verb paradigms where it is known that they lead to incorrect results\n", "* where the rules did not suffice, we have searched the corpus for other occurrences of the same word, to get clues;\n", "* where we knew that clues pointed in the wrong direction, we have applied a list of exceptions (currently a list of only the word בָּתִּֽים (\\*bottˈîm => bāttˈîm)\n", "* we have a fair test set with critical cases that all pass\n", "* we have a few tables of all cases where the algorithm has made corpus based decisions and lexical decisions\n", "* we are open for your corrections: login into [SHEBANQ](https://shebanq.ancient-data.org), go to a passage with offending phonetic transliteration, and make a manual note. **Tip:** Give that note the keyword ``phono``, then we\n", " will collect them.\n", "\n", "Q: *To me, this is not entirely satisfying.*\n", "\n", "A: Fair enough. Consider jumping to [Bible Online Learner](http://bibleol.3bmoodle.dk/text/show_text),\n", "where they have built in a pretty good transliteration, based on a different method of rule application. It is documented in an article by Nicolai Winther-Nielsen:\n", "[Transliteration of Biblical Hebrew for the Role-Lexical Module](http://www.see-j.net/index.php/hiphil/article/view/62)\n", "and additional information can be found in Claus Tøndering's\n", "[Bible Online Learner, Software on GitHub](https://github.com/EzerIT/BibleOL).\n", "See also [Lex: A software project for linguists](http://www.see-j.net/index.php/hiphil/article/view/60/56).\n", "\n", "We are planning to conduct an automatic comparison of both transliteration schemes over the whole corpus.\n", "\n", "Q: *Who is the **we**?*\n", "\n", "That is the author of this notebook, [Dirk Roorda](mailto:dirk.roorda@dans.knaw.nl), working together with Martijn Naaijer and getting input from Nicolai Winther-Nielsen and Willem van Peursen." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview of the results\n", "\n", "1. The main result is a python function ``phono(``*ETCBC-original*``, ...): ``*phonetic transliteration*.\n", "1. Showcases and tests: how the function solves particular classes of problems.\n", " The *cases* file shows a set of cases that have been generated in the last run.\n", "\n", "The *tests* files show a prepared set of cases, against which to test new versions of the algorithm. These results have been obtained on version `c` of the\n", "[BHSA dataset](https://etcbc.github.io/bhsa).\n", " 1. [mixed](mixedc.html)\n", " with log file\n", " [mixed_debug](mixed_debugc.txt).\n", " 1. [qamets-non-verb cases](qamets_nonverb_casesc.html)\n", " and\n", " [qamets-non-verb tests](qamets_nonverb_testsc.html)\n", " with log file\n", " [qamets-nonverb_tests_debug](qamets_nonverb_tests_debugc.txt).\n", " The result of searching the corpus for related occurrences and\n", " having them vote for qatan/gadol interpretation of the qamets.\n", " 1. [qamets-verb cases](qamets_verb_casesc.html)\n", " and\n", " [qamets-verb tests](qamets_verb_testsc.html)\n", " with log file\n", " [qamets-verb tests-debug](qamets_verb_tests_debugc.txt).\n", " The result of suppressing the qatan interpretation of the qamets regardless of accent\n", " for a definite set of *verb forms*.\n", " 1. [qamets-prs cases](qamets_prs_casesc.html)\n", " and\n", " [qamets-prs tests](qamets_prs_testsc.html)\n", " with log file\n", " [qamets-prs tests-debug](qamets_prs_tests_debugc.txt).\n", " The result of suppressing the qatan interpretation of the qamets in *pronominal suffixes*.\n", "1. A [plain text](combi.txt) with the complete text in BHSA transliteration and phonetic transcription,\n", " verse by verse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview of the method\n", "\n", "## High-level description\n", "\n", "1. **BHSA transliteration**\n", " Our starting point is the BHSA full transliteration of the Hebrew Masoretic text.\n", " This transliteration is in 1-1 correspondence with the Masoretic text, including all vowels and accents.\n", "1. **Grammar rules**\n", " We have implemented the rules we find in grammars of Hebrew about long and short qamets, mobile and silent schwa,\n", " dagesh, and mater lectionis.\n", " The implementation takes the form of a row of *regular expressions*,\n", " where we transliterate targeted pieces of the original.\n", " These regular expressions are exquisitely formulated, and must be applied in the given order.\n", " *Beware:* Seemingly innocent modifications in these expressions or in the order of application,\n", " may ruin the transcription completely.\n", "1. **Qamets puzzles: verbs**\n", " In many verb forms the grammar rules would dictate that a certain qamets is qatan while in fact it is gadol.\n", " In most cases this is caused by the fact that no accent has been marked on the syllable that carries the\n", " qamets in question. There is a limited set of verb paradigms where this occurs.\n", " We detect those and suppress qamets qatan interpretation for them.\n", "1. **Qamets puzzles: non-verbs**\n", " There are quite a few non-verb occurrences where the accent pattern of a word invites a qamets to become\n", " qatan, that is, by the grammar rules.\n", " Yet, other occurrences of the same lexeme have other accent patterns, and\n", " lead to a gadol interpretation of the same qamets.\n", " In this case we count the unique cases in favor of gadol versus qatan, and let the majority decide for all\n", " occurrences. In cases where we know that the majority votes wrong, we have intervened.\n", "\n", "### Qamets work hypothesis\n", "Note, that in the *non-verb qamets puzzles* we have tacitly made the assumption that qamets qatan and gadol are not phonological variants of each other.\n", "In other words, it never occurs that a qamets gadol becomes shortened into a qamets qatan.\n", "From the grammar rules it follows that short versions of the qamets can only be\n", "\n", "* patah\n", "* schwa\n", "* composite schwa with patah\n", "\n", "and never\n", "\n", "* qamets qatan\n", "* composite schwa with qamets\n", "\n", "Whether this hypothesis is right, is not my competence.\n", "We just use it as a working hypothesis.\n", "\n", "## Lexical information\n", "\n", "This method is not a pure method, in the sense that it works only with the information given in the source string.\n", "We *cheat*, i.e. we use morphological information from the BHSA database to\n", "steer us into the right direction. To this end, the input of the `phono()` is always a\n", "Text-Fabric node, from which we can get all information we need.\n", "\n", "More precisely, the input is a sequence of nodes.\n", "This sequence is meant to correspond to a sequence of slots belonging to words that are written adjacently\n", "(no space between, no maqef between).\n", "From these nodes we can look up:\n", "\n", "* the BHSA transliteration\n", "* the qere (if there is a discrepancy between ketiv and qere)\n", "* additional lexical information (taken from the last node)\n", "\n", "## Combined words\n", "\n", "You can use `phono()` to transliterate multiple words at the same time, but you can also do individual words,\n", "even if in Hebrew they are written together.\n", "However, it is better to feed combined words to `phono()` in one go, because the prefix word may influence the transliteration of the postfix word. Think of the article followed by word starting with a `BGDKPT` letter.\n", "The dagesh in the `BGDKPT` is interpreted as a lene, if the word stands on its own, but as a forte if it is combined.\n", "\n", "However, it not not advised to feed longer strings to `phono()`, because when phono retrieves lexical information, it uses the information of the last node that matches a word in the input string.\n", "\n", "## Accents\n", "\n", "We determine \"primary\" and \"secondary\" stress in our transliteration, but this must not be taken in a phonetic sense.\n", "Every syllable that carries an accent pointing will get a primary stress mark.\n", "However, a few specific accent pointings are not deemed to produce an accent, and an other group of accents\n", "is deemed to produce only a secondary accent.\n", "The last syllable of a word also gets a secondary accent by default.\n", "We have not yet tried to be more precise in this, so *segolates* do not get the treatment they deserve.\n", "\n", "The main rationale for accents is that they prevent a qamets to be read as qatan.\n", "\n", "## Individual symbols\n", "\n", "We have made a careful selection of UNICODE symbols to represent Hebrew sounds.\n", "Sometimes we follow the phonetic usage of the symbols, sometimes we follow wide spread custom.\n", "The actual mapping can be plugged in quite easily,\n", "and the intermediate stages in the transformation do not use these symbols,\n", "so the algorithm can be easily adapted to other choices.\n", "\n", "### Consonants\n", "\n", "Provided it is not part of a long vowel, we write `י` as `y`,\n", "whilst `j` would be more in line with the phonetic alphabet.\n", "\n", "Likewise, we write `ו` as `w`, if it is not part of a long vowel.\n", "If a word ends in `יו` the `ו` is not a mater lectionis, and the `י` gets elided.\n", "We represent this phonetically as `ʸw`.\n", "\n", "With regards to the `BGDKPT` letters,\n", "it would have been attractive to use the letters `b g d k p t` without\n", "diacritic for the plosive variants, and with a suitable diacritic for the fricative variants.\n", "Alas, the UNICODE table does not offer such a suitable diacritic that is available for all these particular 6 letters.\n", "\n", "So, we use `b g d k p t` for the plosives, but for the fricatives we use `v ḡ ḏ ḵ f ṯ`.\n", "\n", "With regards to the *emphatic* consonants ט and ח and צ we\n", "represent them with a dot below: `ṭ ḥ ṣ`.\n", "ק is just `q`.\n", "\n", "\n", "ע and א translate to `ʕ` and `ʔ`.\n", "\n", "שׁ and שׂ translate to `š` and `ś`.\n", "ס is just `s`.\n", "\n", "When א and ה are mater lectionis, they are left out. A ה with mappiq becomes just `h`,\n", "like every ה which is not a mater lectionis.\n", "\n", "We do not mark the deviant final forms of the consonants ך and ם and ן and ף and ץ, assuming that\n", "this is just a scriptural peculiarity, with no effect on the actual sounds.\n", "\n", "The remaining consonants go as follows:\n", "\n", "
ל | l |
מ | m |
נ | n |
ר | r |
ז | z |
*
*?
*+
+
+?
++
{
*n*,
*m*}
{
*n*,
*m*}?
{
*n*,
*m*}+
\" + stats + \"
\") if stats else \"\")\n", " + ((\"\" + mystats + \"
\") if mystats else \"\")\n", " + \"\"\"\n", "