{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Accent patterns\n",
"\n",
"Request by Robert Voogdgeert.\n",
"\n",
"Make a CSV of half verses in a representation that only shows accents and word boundaries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import re\n",
"\n",
"from tf.app import use\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"TF-app: ~/github/annotation/app-bhsa/code"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/text-fabric-data/etcbc/bhsa/tf/c"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/text-fabric-data/etcbc/phono/tf/c"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"data: ~/text-fabric-data/etcbc/parallels/tf/c"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"A = use(\"ETCBC/bhsa:clone\", hoist=globals(), silent=\"deep\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chunks\n",
"\n",
"You can configure a chunk to be `half_verse` or `clause`.\n",
"\n",
"If the chunk is `half_verse`, we use the feature `label` to identify it within the verse.\n",
"\n",
"If the chunk is `clause`, we use the sentence number and the clause number to identify it.\n",
"\n",
"In `chunkTypes` we store a mapping of all chunk types we support to functions that provide a label for such chunks."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"chunkTypes = dict(\n",
" half_verse=F.label.v,\n",
" clause=lambda n: f'{F.number.v(L.u(n, otype=\"sentence\")[0])}.{F.number.v(n)}',\n",
" clause_atom=F.number.v,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"Here is a function that shows chunks."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def showChunks(chunks):\n",
" for c in chunks:\n",
" cType = F.otype.v(c)\n",
" headFunc = chunkTypes.get(cType, None)\n",
" head = \"?\" if headFunc is None else headFunc(c)\n",
" passage = T.sectionFromNode(c)\n",
" heading = \"{} {}:{} {}\".format(*passage, head)\n",
" text = T.text(c, fmt=\"text-trans-full\")\n",
" print(f\"{heading}\\n\\t{text}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's inspect a few half verses (the first and second ones and one which contains\n",
"a word with an in-word space):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Genesis 1:1 A\n",
"\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM \n",
"Genesis 1:1 B\n",
"\t>;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n",
"1_Chronicles 2:54 A\n",
"\tB.:N;74J FAL:M@81> B.;71JT_LE33XEM03 W.-N:VO74WP@TI80J @92B \n"
]
}
],
"source": [
"chunkType = \"half_verse\"\n",
"\n",
"(h1, h2) = F.otype.s(chunkType)[0:2]\n",
"v = T.nodeFromSection((\"1_Chronicles\", 2, 54))\n",
"h3 = L.d(v, otype=chunkType)[0]\n",
"\n",
"showChunks((h1, h2, h3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's inspect a few clauses (the first ten)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Genesis 1:1 1.1\n",
"\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n",
"Genesis 1:2 2.1\n",
"\tW:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. \n",
"Genesis 1:2 3.1\n",
"\tW:-XO73CEK: :ELOHI80JM M:RAXE73PET MER >:ELOHI73JM \n",
"Genesis 1:3 6.1\n",
"\tJ:HI74J >O92WR \n",
"Genesis 1:3 7.1\n",
"\tWA45-J:HIJ&>O75WR00 \n",
"Genesis 1:4 8.1\n",
"\tWA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR \n",
"Genesis 1:4 8.2\n",
"\tK.IJ&VO92WB \n",
"Genesis 1:4 9.1\n",
"\tWA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 \n"
]
}
],
"source": [
"chunkType = \"clause\"\n",
"\n",
"chunks = F.otype.s(chunkType)[0:10]\n",
"\n",
"showChunks(chunks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pattern from a chunk\n",
"\n",
"We define a function to get the accent pattern from a chunk."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function works by stripping all non-digit-non-space material, then splitting on space, then\n",
"dividing the numbers into pairs, and then joining everything together.\n",
"\n",
"We exclude some marks, because they are not proper cantillation accents."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"excludedAccents = {\n",
" \"35\",\n",
" \"45\",\n",
" \"75\",\n",
" \"95\", # meteg\n",
" \"52\",\n",
" \"53\", # upper and lower dots\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"stripPat = re.compile(r\"[^0-9 ]\")\n",
"accentPat = re.compile(r\"[0-9]{2}\")\n",
"\n",
"\n",
"def getAccents(chunk):\n",
" trans = T.text(chunk, fmt=\"text-trans-full\").replace(\"_\", \" \")\n",
" words = stripPat.sub(\"\", trans).split()\n",
" items = []\n",
" for word in words:\n",
" accents = [ac for ac in accentPat.findall(word) if ac not in excludedAccents]\n",
" items.append(\"_\".join(accents))\n",
" return \" \".join(items)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"73 74 92\n",
"71 73 71 00\n",
"74 81 71 33_03 74_80 73 74 92\n",
"73 74 92 71 73 71 00\n",
"81 71 33_03 80\n",
"73 74 92\n",
"74 80 73 71 00\n",
"71 73\n",
"74 92\n",
"00\n",
"94 91 73\n",
"92\n",
"74 80 71 73 71 00\n"
]
}
],
"source": [
"for c in (h1, h2, h3, *chunks):\n",
" print(getAccents(c))"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"# Process the selection\n",
"\n",
"We define a function to process a given selection with a given chunk type.\n",
"\n",
"The file is saved to the `destination`, by default your Downloads folder."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def process(selection, chunkType, destination=\"~/Downloads\"):\n",
" A.indent(reset=True)\n",
" A.info(f\"Gather all {chunkType}s ...\")\n",
" rows = []\n",
"\n",
" headFunc = chunkTypes.get(chunkType, None)\n",
" if not headFunc:\n",
" A.error(f\"Chunk type {chunkType} not supported\")\n",
" return\n",
"\n",
" for v in F.otype.s(\"verse\"):\n",
" (book, chapter, verse) = T.sectionFromNode(v)\n",
" if selection is not None and book not in selection:\n",
" continue\n",
" for chunk in L.d(v, otype=chunkType):\n",
" head = headFunc(chunk)\n",
" accents = getAccents(chunk)\n",
" rows.append((book, chapter, verse, head, accents))\n",
" A.info(f\"{len(rows)} {chunkType}s done\")\n",
"\n",
" csvRaw = f\"{destination}/accents-{chunkType}.csv\"\n",
" csv = os.path.expanduser(csvRaw)\n",
"\n",
" with open(csv, \"w\") as fh:\n",
" for row in rows:\n",
" fh.write(\",\".join(str(f) for f in row) + \"\\n\")\n",
"\n",
" A.info(f\"Results written to {csvRaw}\")\n",
" return rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Selection\n",
"\n",
"You may choose to do all books or selected books only."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# tweak this cell by specifying the set of books you want done (English book names)\n",
"# books = None means: all books\n",
"\n",
"books = None\n",
"# books = {'Numbers', 'Ruth'}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Half verses"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0.00s Gather all half_verses ...\n",
" 2.84s 45180 half_verses done\n",
" 2.93s Results written to ~/Downloads/accents-half_verse.csv\n"
]
}
],
"source": [
"rows = process(books, \"half_verse\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Genesis', 1, 1, 'A', '73 74 92'),\n",
" ('Genesis', 1, 1, 'B', '71 73 71 00'),\n",
" ('Genesis', 1, 2, 'A', '81 71 33_03 80 73 74 92'),\n",
" ('Genesis', 1, 2, 'B', '74 80 73 71 00'),\n",
" ('Genesis', 1, 3, 'A', '71 73 74 92'),\n",
" ('Genesis', 1, 3, 'B', '00'),\n",
" ('Genesis', 1, 4, 'A', '94 91 73 92'),\n",
" ('Genesis', 1, 4, 'B', '74 80 71 73 71 00'),\n",
" ('Genesis', 1, 5, 'A', '63 70_05 03 80 73 74 92'),\n",
" ('Genesis', 1, 5, 'B', '71 73 71 00')]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rows[0:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Clauses"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0.00s Gather all clauses ...\n",
" 3.53s 88071 clauses done\n",
" 3.68s Results written to ~/Downloads/accents-clause.csv\n"
]
}
],
"source": [
"rows = process(books, \"clause\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Genesis', 1, 1, '1.1', '73 74 92 71 73 71 00'),\n",
" ('Genesis', 1, 2, '2.1', '81 71 33_03 80'),\n",
" ('Genesis', 1, 2, '3.1', '73 74 92'),\n",
" ('Genesis', 1, 2, '4.1', '74 80 73 71 00'),\n",
" ('Genesis', 1, 3, '5.1', '71 73'),\n",
" ('Genesis', 1, 3, '6.1', '74 92'),\n",
" ('Genesis', 1, 3, '7.1', '00'),\n",
" ('Genesis', 1, 4, '8.1', '94 91 73'),\n",
" ('Genesis', 1, 4, '8.2', '92'),\n",
" ('Genesis', 1, 4, '9.1', '74 80 71 73 71 00')]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rows[0:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Clause atoms"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0.00s Gather all clause_atoms ...\n",
" 2.79s 90688 clause_atoms done\n",
" 2.94s Results written to ~/Downloads/accents-clause_atom.csv\n"
]
}
],
"source": [
"rows = process(books, \"clause_atom\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Genesis', 1, 1, 1, '73 74 92 71 73 71 00'),\n",
" ('Genesis', 1, 2, 2, '81 71 33_03 80'),\n",
" ('Genesis', 1, 2, 3, '73 74 92'),\n",
" ('Genesis', 1, 2, 4, '74 80 73 71 00'),\n",
" ('Genesis', 1, 3, 5, '71 73'),\n",
" ('Genesis', 1, 3, 6, '74 92'),\n",
" ('Genesis', 1, 3, 7, '00'),\n",
" ('Genesis', 1, 4, 8, '94 91 73'),\n",
" ('Genesis', 1, 4, 9, '92'),\n",
" ('Genesis', 1, 4, 10, '74 80 71 73 71 00')]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rows[0:10]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}