{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Accent patterns\n", "\n", "Request by Robert Voogdgeert.\n", "\n", "Make a CSV of half verses in a representation that only shows accents and word boundaries." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "\n", "from tf.app import use\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "TF-app: ~/github/annotation/app-bhsa/code" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/etcbc/bhsa/tf/c" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/etcbc/phono/tf/c" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/text-fabric-data/etcbc/parallels/tf/c" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"ETCBC/bhsa:clone\", hoist=globals(), silent=\"deep\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Chunks\n", "\n", "You can configure a chunk to be `half_verse` or `clause`.\n", "\n", "If the chunk is `half_verse`, we use the feature `label` to identify it within the verse.\n", "\n", "If the chunk is `clause`, we use the sentence number and the clause number to identify it.\n", "\n", "In `chunkTypes` we store a mapping of all chunk types we support to functions that provide a label for such chunks." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "chunkTypes = dict(\n", " half_verse=F.label.v,\n", " clause=lambda n: f'{F.number.v(L.u(n, otype=\"sentence\")[0])}.{F.number.v(n)}',\n", " clause_atom=F.number.v,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Here is a function that shows chunks." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def showChunks(chunks):\n", " for c in chunks:\n", " cType = F.otype.v(c)\n", " headFunc = chunkTypes.get(cType, None)\n", " head = \"?\" if headFunc is None else headFunc(c)\n", " passage = T.sectionFromNode(c)\n", " heading = \"{} {}:{} {}\".format(*passage, head)\n", " text = T.text(c, fmt=\"text-trans-full\")\n", " print(f\"{heading}\\n\\t{text}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect a few half verses (the first and second ones and one which contains\n", "a word with an in-word space):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 1:1 A\n", "\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM \n", "Genesis 1:1 B\n", "\t>;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "1_Chronicles 2:54 A\n", "\tB.:N;74J FAL:M@81> B.;71JT_LE33XEM03 W.-N:VO74WP@TI80J @92B \n" ] } ], "source": [ "chunkType = \"half_verse\"\n", "\n", "(h1, h2) = F.otype.s(chunkType)[0:2]\n", "v = T.nodeFromSection((\"1_Chronicles\", 2, 54))\n", "h3 = L.d(v, otype=chunkType)[0]\n", "\n", "showChunks((h1, h2, h3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect a few clauses (the first ten)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Genesis 1:1 1.1\n", "\tB.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 \n", "Genesis 1:2 2.1\n", "\tW:-H@->@81REY H@J:T@71H TO33HW.03 W@-BO80HW. \n", "Genesis 1:2 3.1\n", "\tW:-XO73CEK: :ELOHI80JM M:RAXE73PET MER >:ELOHI73JM \n", "Genesis 1:3 6.1\n", "\tJ:HI74J >O92WR \n", "Genesis 1:3 7.1\n", "\tWA45-J:HIJ&>O75WR00 \n", "Genesis 1:4 8.1\n", "\tWA-J.A94R:> >:ELOHI91JM >ET&H@->O73WR \n", "Genesis 1:4 8.2\n", "\tK.IJ&VO92WB \n", "Genesis 1:4 9.1\n", "\tWA-J.AB:D.;74L >:ELOHI80JM B.;71JN H@->O73WR W.-B;71JN HA-XO75CEK:00 \n" ] } ], "source": [ "chunkType = \"clause\"\n", "\n", "chunks = F.otype.s(chunkType)[0:10]\n", "\n", "showChunks(chunks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pattern from a chunk\n", "\n", "We define a function to get the accent pattern from a chunk." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function works by stripping all non-digit-non-space material, then splitting on space, then\n", "dividing the numbers into pairs, and then joining everything together.\n", "\n", "We exclude some marks, because they are not proper cantillation accents." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "excludedAccents = {\n", " \"35\",\n", " \"45\",\n", " \"75\",\n", " \"95\", # meteg\n", " \"52\",\n", " \"53\", # upper and lower dots\n", "}" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "stripPat = re.compile(r\"[^0-9 ]\")\n", "accentPat = re.compile(r\"[0-9]{2}\")\n", "\n", "\n", "def getAccents(chunk):\n", " trans = T.text(chunk, fmt=\"text-trans-full\").replace(\"_\", \" \")\n", " words = stripPat.sub(\"\", trans).split()\n", " items = []\n", " for word in words:\n", " accents = [ac for ac in accentPat.findall(word) if ac not in excludedAccents]\n", " items.append(\"_\".join(accents))\n", " return \" \".join(items)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "73 74 92\n", "71 73 71 00\n", "74 81 71 33_03 74_80 73 74 92\n", "73 74 92 71 73 71 00\n", "81 71 33_03 80\n", "73 74 92\n", "74 80 73 71 00\n", "71 73\n", "74 92\n", "00\n", "94 91 73\n", "92\n", "74 80 71 73 71 00\n" ] } ], "source": [ "for c in (h1, h2, h3, *chunks):\n", " print(getAccents(c))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "# Process the selection\n", "\n", "We define a function to process a given selection with a given chunk type.\n", "\n", "The file is saved to the `destination`, by default your Downloads folder." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def process(selection, chunkType, destination=\"~/Downloads\"):\n", " A.indent(reset=True)\n", " A.info(f\"Gather all {chunkType}s ...\")\n", " rows = []\n", "\n", " headFunc = chunkTypes.get(chunkType, None)\n", " if not headFunc:\n", " A.error(f\"Chunk type {chunkType} not supported\")\n", " return\n", "\n", " for v in F.otype.s(\"verse\"):\n", " (book, chapter, verse) = T.sectionFromNode(v)\n", " if selection is not None and book not in selection:\n", " continue\n", " for chunk in L.d(v, otype=chunkType):\n", " head = headFunc(chunk)\n", " accents = getAccents(chunk)\n", " rows.append((book, chapter, verse, head, accents))\n", " A.info(f\"{len(rows)} {chunkType}s done\")\n", "\n", " csvRaw = f\"{destination}/accents-{chunkType}.csv\"\n", " csv = os.path.expanduser(csvRaw)\n", "\n", " with open(csv, \"w\") as fh:\n", " for row in rows:\n", " fh.write(\",\".join(str(f) for f in row) + \"\\n\")\n", "\n", " A.info(f\"Results written to {csvRaw}\")\n", " return rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Selection\n", "\n", "You may choose to do all books or selected books only." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# tweak this cell by specifying the set of books you want done (English book names)\n", "# books = None means: all books\n", "\n", "books = None\n", "# books = {'Numbers', 'Ruth'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Half verses" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Gather all half_verses ...\n", " 2.84s 45180 half_verses done\n", " 2.93s Results written to ~/Downloads/accents-half_verse.csv\n" ] } ], "source": [ "rows = process(books, \"half_verse\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Genesis', 1, 1, 'A', '73 74 92'),\n", " ('Genesis', 1, 1, 'B', '71 73 71 00'),\n", " ('Genesis', 1, 2, 'A', '81 71 33_03 80 73 74 92'),\n", " ('Genesis', 1, 2, 'B', '74 80 73 71 00'),\n", " ('Genesis', 1, 3, 'A', '71 73 74 92'),\n", " ('Genesis', 1, 3, 'B', '00'),\n", " ('Genesis', 1, 4, 'A', '94 91 73 92'),\n", " ('Genesis', 1, 4, 'B', '74 80 71 73 71 00'),\n", " ('Genesis', 1, 5, 'A', '63 70_05 03 80 73 74 92'),\n", " ('Genesis', 1, 5, 'B', '71 73 71 00')]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rows[0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clauses" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Gather all clauses ...\n", " 3.53s 88071 clauses done\n", " 3.68s Results written to ~/Downloads/accents-clause.csv\n" ] } ], "source": [ "rows = process(books, \"clause\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Genesis', 1, 1, '1.1', '73 74 92 71 73 71 00'),\n", " ('Genesis', 1, 2, '2.1', '81 71 33_03 80'),\n", " ('Genesis', 1, 2, '3.1', '73 74 92'),\n", " ('Genesis', 1, 2, '4.1', '74 80 73 71 00'),\n", " ('Genesis', 1, 3, '5.1', '71 73'),\n", " ('Genesis', 1, 3, '6.1', '74 92'),\n", " ('Genesis', 1, 3, '7.1', '00'),\n", " ('Genesis', 1, 4, '8.1', '94 91 73'),\n", " ('Genesis', 1, 4, '8.2', '92'),\n", " ('Genesis', 1, 4, '9.1', '74 80 71 73 71 00')]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rows[0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clause atoms" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s Gather all clause_atoms ...\n", " 2.79s 90688 clause_atoms done\n", " 2.94s Results written to ~/Downloads/accents-clause_atom.csv\n" ] } ], "source": [ "rows = process(books, \"clause_atom\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Genesis', 1, 1, 1, '73 74 92 71 73 71 00'),\n", " ('Genesis', 1, 2, 2, '81 71 33_03 80'),\n", " ('Genesis', 1, 2, 3, '73 74 92'),\n", " ('Genesis', 1, 2, 4, '74 80 73 71 00'),\n", " ('Genesis', 1, 3, 5, '71 73'),\n", " ('Genesis', 1, 3, 6, '74 92'),\n", " ('Genesis', 1, 3, 7, '00'),\n", " ('Genesis', 1, 4, 8, '94 91 73'),\n", " ('Genesis', 1, 4, 9, '92'),\n", " ('Genesis', 1, 4, 10, '74 80 71 73 71 00')]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rows[0:10]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }