\n",
" <\"\">\n",
" <\"\">\n",
" <\"\" \"\" morph=\"\">>\n",
"
\n",
" <\"\" morph=\"\">>\n",
" <\"\" \"\" morph=\"\">>\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Morph\n",
"\n",
"but all you want is the treasure: *morph*\n",
"\n",
"```\n",
"HTi\n",
"HVqp3fs\n",
"HNcmsa\n",
"```\n",
"\n",
"from"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"```\n",
"
\n",
"
\n",
" \n",
" אֵיכָ֣ה\n",
" ׀\n",
" יָשְׁבָ֣ה\n",
" בָדָ֗ד\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## In a nutshell\n",
"\n",
"1. you get more than you want\n",
"1. what you want is intricately wrapped up\n",
"1. **we suffer from leaking concerns**\n",
"1. **we are being micro-managed at several levels**"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"but we do need better logistics in treasure sharing"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Text-Fabric: weave your own web\n",
"\n",
"\n",
"\n",
"AD 1425 [Hausbücher der Nürnberger Zwölfbrüderstiftungen](http://www.nuernberger-hausbuecher.de/75-Amb-2-317-4-v/data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Text Fabric\n",
"\n",
"![TF](images/tf-small.png)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"## Warp and weft\n",
"\n",
"Every TF resource must have two special features: **warp** features.\n",
"\n",
"All other features are **weft** features, they are woven into the warp.\n",
"\n",
"\n",
"[wikipedia](https://en.wikipedia.org/wiki/Weaving)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Warp features\n",
"\n",
"* `otype`: each node has a type\n",
"* `oslots`: each non-slot node is linked to a set of slot nodes\n",
"* `otext`: specification of sections and text formats"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Weft features\n",
"\n",
"These contain the concrete, tangible information:\n",
"\n",
"* the text\n",
"* the linguistic annotations\n",
"* additional data that is linked to the text"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Resources and modules\n",
"\n",
"A TF **resource** is a bunch of TF files.\n",
"\n",
"A TF file contains the data for a single *feature*.\n",
"\n",
"* one fixed set of **warp features**: `otype oslots otext`\n",
"* arbitrary many **weft features**: `sp g_word_utf8`, ...\n",
"* can be augmented with wefts from **TF modules**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"A TF **module**\n",
"* has only **weft features**\n",
"* uses the **warp** of a *main* resource."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### A weave\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fabric model + IKEA logistics => Workbench"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Workbench for Cuneiform Tablets\n",
"\n",
"\n",
"\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:33:33.852953Z",
"start_time": "2018-04-14T08:33:32.773215Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found 2095 ideograph linearts\n",
"Found 2724 tablet linearts\n",
"Found 5495 tablet photos\n"
]
},
{
"data": {
"text/markdown": [
"**Documentation:** Uruk IV-III (v1.0) Feature docs Cunei API Text-Fabric API"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"This notebook online:\n",
"NBViewer\n",
"GitHub\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"LOC = (\"~/github\", \"Nino-cunei/uruk\", \"Copenhagen2018\")\n",
"from tf.extra.cunei import Cunei # noqa E402\n",
"\n",
"CN = Cunei(*LOC)\n",
"CN.api.makeAvailableIn(globals())"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:00.360919Z",
"start_time": "2018-04-14T08:36:00.356041Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"pNumX = \"P005381\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:01.870439Z",
"start_time": "2018-04-14T08:36:01.860583Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
" "
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CN.photo(pNumX, width=\"400\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:05.176041Z",
"start_time": "2018-04-14T08:36:05.164397Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
" "
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CN.lineart(pNumX, width=\"300\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:08.367307Z",
"start_time": "2018-04-14T08:36:08.361416Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"&P005381 = MSVO 3, 70\n",
"#atf: lang qpc \n",
"@obverse \n",
"@column 1 \n",
"1.a. 2(N14) , SZE~a SAL TUR3~a NUN~a \n",
"1.b. 3(N19) , |GISZ.TE| \n",
"2. 1(N14) , NAR NUN~a SIG7 \n",
"3. 2(N04)# , PIRIG~b1 SIG7 URI3~a NUN~a \n",
"@column 2 \n",
"1. 3(N04) , |GISZ.TE| GAR |SZU2.((HI+1(N57))+(HI+1(N57)))| GI4~a \n",
"2. , GU7 AZ SI4~f \n",
"@reverse \n",
"@column 1 \n",
"1. 3(N14) , SZE~a \n",
"2. 3(N19) 5(N04) , \n",
"3. , GU7 \n",
"@column 2 \n",
"1. , AZ SI4~f \n"
]
}
],
"source": [
"tabletX = T.nodeFromSection((pNumX,))\n",
"sourceLines = CN.getSource(tabletX)\n",
"print(\"\\n\".join(sourceLines))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:09.525491Z",
"start_time": "2018-04-14T08:36:09.518282Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"case = CN.nodeFromCase((pNumX, \"obverse:1\", \"1a\"))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:10.528763Z",
"start_time": "2018-04-14T08:36:10.516442Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
" "
],
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CN.lineart(CN.getOuterQuads(case), width=50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tablet calculator"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:13.870074Z",
"start_time": "2018-04-14T08:36:13.865487Z"
}
},
"outputs": [],
"source": [
"pNums = \"\"\"\n",
" P005381\n",
" P005447\n",
" P005448\n",
"\"\"\".strip().split()\n",
"\n",
"pNumPat = \"|\".join(pNums)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:14.924425Z",
"start_time": "2018-04-14T08:36:14.919714Z"
}
},
"outputs": [],
"source": [
"shinPP = dict(\n",
" N41=0.2,\n",
" N04=1,\n",
" N19=6,\n",
" N46=60,\n",
" N36=180,\n",
" N49=1800,\n",
")\n",
"\n",
"shinPPPat = \"|\".join(shinPP)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We query for shinPP numerals on the faces of selected tablets.\n",
"The result of the query is a list of tuples `(t, f, s)` consisting of\n",
"a tablet node, a face node and a sign node, which is a shinPP numeral."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:16.680654Z",
"start_time": "2018-04-14T08:36:16.675503Z"
}
},
"outputs": [],
"source": [
"query = f\"\"\"\n",
"tablet catalogId={pNumPat}\n",
" face\n",
" sign type=numeral grapheme={shinPPPat}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:19.823035Z",
"start_time": "2018-04-14T08:36:19.563384Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"20"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = list(S.search(query))\n",
"len(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have found 20 numerals.\n",
"We group the results by tablet and by face."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:21.888749Z",
"start_time": "2018-04-14T08:36:21.881158Z"
}
},
"outputs": [],
"source": [
"numerals = {}\n",
"for (tablet, face, sign) in results:\n",
" numerals.setdefault(tablet, {}).setdefault(face, []).append(sign)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We show the tablets, the shinPP numerals per face, and we add up the numerals per face."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:24.500497Z",
"start_time": "2018-04-14T08:36:24.491743Z"
}
},
"outputs": [],
"source": [
"def dm(x):\n",
" display(Markdown(x))\n",
"\n",
"\n",
"def showResult(pNum, tabletLineart):\n",
" dm(\"---\\n\")\n",
" tablet = T.nodeFromSection((pNum,))\n",
" if tabletLineart:\n",
" display(CN.lineart(tablet, withCaption=\"top\", width=\"200\"))\n",
" faces = numerals[tablet]\n",
" for (face, signs) in faces.items():\n",
" dm(f\"### {F.type.v(face)}\")\n",
" distinctSigns = {}\n",
" for s in signs:\n",
" distinctSigns.setdefault(CN.atfFromSign(s), []).append(s)\n",
" display(CN.lineart(distinctSigns))\n",
" total = 0\n",
" for (signAtf, signs) in distinctSigns.items():\n",
" # note that all signs for the same signAtf have the same grapheme and repeat\n",
" value = 0\n",
" for s in signs:\n",
" value += F.repeat.v(s) * shinPP[F.grapheme.v(s)]\n",
" total += value\n",
" amount = len(signs)\n",
" shinPPval = shinPP[F.grapheme.v(signs[0])]\n",
" repeat = F.repeat.v(signs[0])\n",
" print(f\"{amount} x {signAtf} = {amount} x {repeat} x {shinPPval} = {value}\")\n",
" dm(f\"**total** = **{total}**\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:26.486629Z",
"start_time": "2018-04-14T08:36:26.374836Z"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "01b350c05e0d4ef5b93ff4e5afd35f0b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(Dropdown(description='pNum', options=('P005381', 'P005447', 'P005448'), value='P005381')…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"widget = interactive(showResult, pNum=sorted(pNums), tabletLineart=False)\n",
"display(widget)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Workbench for Syriac Linking\n",
"\n",
"![x](images/syrhum.png)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:39.447959Z",
"start_time": "2018-04-14T08:36:39.442297Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"from tf.fabric import Fabric # noqa E402\n",
"\n",
"REPO = \"~/github/etcbc/linksyr\"\n",
"SOURCE = \"syrnt\"\n",
"CORPUS = f\"{REPO}/data/tf/{SOURCE}\""
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:44.925973Z",
"start_time": "2018-04-14T08:36:43.735899Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This is Text-Fabric 3.2.5\n",
"Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api\n",
"Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb\n",
"Example data : https://github.com/Dans-labs/text-fabric-data\n",
"\n",
"37 features found and 0 ignored\n"
]
}
],
"source": [
"TF = Fabric(locations=[CORPUS], modules=[\"\"], silent=False)\n",
"api = TF.load(\"\", silent=True)\n",
"allFeatures = TF.explore(silent=True, show=True)\n",
"loadableFeatures = allFeatures[\"nodes\"] + allFeatures[\"edges\"]\n",
"TF.load(loadableFeatures, add=True, silent=True)\n",
"api.makeAvailableIn(globals())"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:45.673645Z",
"start_time": "2018-04-14T08:36:45.664227Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"SYRIACA = os.path.expanduser(f\"{REPO}/data/syriaca\")\n",
"SC_PEOPLE = f\"{SYRIACA}/index_of_persons.csv\"\n",
"SC_PLACES = f\"{SYRIACA}/index_of_places.csv\"\n",
"\n",
"SC_URL = \"http://syriaca.org\"\n",
"SC_PLACE = \"place\"\n",
"SC_PERSON = \"person\"\n",
"\n",
"SC_CONFIG = (\n",
" (SC_PERSON, SC_URL, SC_PEOPLE),\n",
" (SC_PLACE, SC_URL, SC_PLACES),\n",
")\n",
"\n",
"SC_TYPES = tuple(x[0] for x in SC_CONFIG)\n",
"\n",
"SC_FIELDS = (\"trans\", \"syriac\", \"id\")\n",
"\n",
"NA_SYRIAC = {\n",
" \"[Syriac Not Available]\",\n",
" \"[Syriac Not\",\n",
" \"[Syriac\",\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:46.766563Z",
"start_time": "2018-04-14T08:36:46.756877Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HTML(\n",
" \"\"\"\n",
"\n",
"\"\"\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:48.033529Z",
"start_time": "2018-04-14T08:36:47.989988Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"tables = {}\n",
"irregular = {}\n",
"\n",
"(transF, syriacF, idF) = SC_FIELDS\n",
"\n",
"for (dataType, baseUrl, dataFile) in SC_CONFIG:\n",
" tables[dataType] = {field: {} for field in SC_FIELDS}\n",
" irregular[dataType] = set()\n",
" dest = tables[dataType]\n",
" irreg = irregular[dataType]\n",
" table = dest[idF]\n",
" indexTrans = dest[transF]\n",
" indexSyriac = dest[syriacF]\n",
" with open(dataFile) as fh:\n",
" for (i, line) in enumerate(fh):\n",
" (transV, syriacV, idV) = line.rstrip(\"\\n\").split(\"\\t\")\n",
" prefix = f\"{baseUrl}/{dataType}/\"\n",
" if idV.startswith(prefix):\n",
" idV = idV.replace(prefix, \"\", 1)\n",
" else:\n",
" irreg.add(idV)\n",
" table[idV] = (transV, syriacV)\n",
" indexTrans.setdefault(transV, set()).add(idV)\n",
" if syriacV not in NA_SYRIAC:\n",
" if \"[\" in syriacV:\n",
" print(f'WARNING {dataType} line {i+1}: syriac value \"{syriacV}\"')\n",
" indexSyriac.setdefault(syriacV, set()).add(idV)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-22T21:38:03.656264Z",
"start_time": "2018-03-22T21:38:03.648688Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" persons: 2371 (irregular: 0)\n",
" by syriac : 1503\n",
" by trans : 1964\n",
"\n",
"\n",
" places: 2488 (irregular: 0)\n",
" by syriac : 527\n",
" by trans : 2165\n",
"\n"
]
}
],
"source": [
"for (dataType, data) in tables.items():\n",
" table = data[idF]\n",
" irreg = irregular[dataType]\n",
" print(\n",
" f\"\"\"\n",
"{dataType:>12}s: {len(table):>5} (irregular: {len(irreg):>4})\n",
"{\"by syriac\":>12} : {len(data[syriacF]):>5}\n",
"{\"by trans\":>12} : {len(data[transF]):>5}\n",
"\"\"\"\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:50.917791Z",
"start_time": "2018-04-14T08:36:50.905348Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"hits = {dataType: {} for dataType in SC_TYPES}\n",
"\n",
"for lx in F.otype.s(\"lexeme\"):\n",
" lex = F.lexeme.v(lx)\n",
" for dataType in SC_TYPES:\n",
" idV = tables[dataType][syriacF].get(lex, None)\n",
" if idV is not None:\n",
" hits[dataType][lx] = idV"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:51.734061Z",
"start_time": "2018-04-14T08:36:51.725256Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" persons: 98 hits\n",
" places: 37 hits\n"
]
}
],
"source": [
"for (dataType, theseHits) in hits.items():\n",
" print(f\"{dataType:>12}s: {len(theseHits):>5} hits\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We show the hits by picking the first occurrence of each lexeme and showing it in context."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:36:54.665574Z",
"start_time": "2018-04-14T08:36:54.645113Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/markdown": [
"### persons\n",
"lexeme | linked | n-occs | passage | verse text\n",
"--- | --- | --- | --- | ---\n",
"ܐܒܐ | 1094 2582 308 | 9 | Matthew 2:22 | ܟܕ ܕܝܢ ܫܡܥ ܕܐܪܟܠܐܘܣ ܗܘܐ ܡܠܟܐ ܒܝܗܘܕ ܚܠܦ ܗܪܘܕܣ ܐܒܘܗܝ ܕܚܠ ܕܢܐܙܠ ܠܬܡܢ ܘܐܬܚܙܝ ܠܗ ܒܚܠܡܐ ܕܢܐܙܠ ܠܐܬܪܐ ܕܓܠܝܠܐ \n",
"ܐܒܪܗܡ | 1108 1109 1110 1546 1547 1548 1549 1551 1552 1553 1554 2202 964 | 2 | Matthew 1:1 | ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ \n",
"ܐܕܝ | 1117 1118 2203 | 2 | Luke 3:28 | ܒܪ ܡܠܟܝ ܒܪ ܐܕܝ ܒܪ ܩܘܣܡ ܒܪ ܐܠܡܘܕܕ ܒܪ ܥܝܪ \n",
"ܐܕܡ | 1560 | 208 | Luke 3:38 | ܒܪ ܐܢܘܫ ܒܪ ܫܝܬ ܒܪ ܐܕܡ ܕܡܢ ܐܠܗܐ \n",
"ܐܗܪܘܢ | 1012 1092 1533 1534 | 3 | Luke 1:5 | ܗܘܐ ܒܝܘܡܬܗ ܕܗܪܘܕܣ ܡܠܟܐ ܕܝܗܘܕܐ ܟܗܢܐ ܚܕ ܕܫܡܗ ܗܘܐ ܙܟܪܝܐ ܡܢ ܬܫܡܫܬܐ ܕܒܝܬ ܐܒܝܐ ܘܐܢܬܬܗ ܡܢ ܒܢܬܗ ܕܐܗܪܘܢ ܫܡܗ ܗܘܐ ܐܠܝܫܒܥ \n",
"ܐܘܒܘܠܘܣ | 3028 | 1 | 2_Timothy 4:21 | ܢܬܒܛܠ ܠܟ ܕܩܕܡ ܣܬܘܐ ܬܐܬܐ ܫܐܠ ܒܫܠܡܟ ܐܘܒܘܠܘܣ ܘܦܘܕܣ ܘܠܝܢܘܣ ܘܩܠܘܕܝܐ ܘܐܚܐ ܟܠܗܘܢ \n",
"ܐܚܐ | 1122 1123 1740 | 3 | Matthew 1:2 | ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ \n",
"ܐܝܣܚܩ | 1788 1789 1790 1791 1792 2578 | 3 | Matthew 1:2 | ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ \n",
"ܐܠܝܐ | 1698 1699 1700 1703 1704 1705 2541 3145 945 | 1 | Matthew 2:18 | ܩܠܐ ܐܫܬܡܥ ܒܪܡܬܐ ܒܟܝܐ ܘܐܠܝܐ ܣܓܝܐܐ ܪܚܝܠ ܒܟܝܐ ܥܠ ܒܢܝܗ ܘܠܐ ܨܒܝܐ ܠܡܬܒܝܐܘ ܡܛܠ ܕܠܐ ܐܝܬܝܗܘܢ \n",
"ܐܠܟܣܢܕܪܘܣ | 1574 887 | 1 | Mark 15:21 | ܘܫܚܪܘ ܚܕ ܕܥܒܪ ܗܘܐ ܫܡܥܘܢ ܩܘܪܝܢܝܐ ܕܐܬܐ ܗܘܐ ܡܢ ܩܪܝܬܐ ܐܒܘܗܝ ܕܐܠܟܣܢܕܪܘܣ ܘܕܪܘܦܘܣ ܕܢܫܩܘܠ ܙܩܝܦܗ \n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"### places\n",
"lexeme | linked | n-occs | passage | verse text\n",
"--- | --- | --- | --- | ---\n",
"ܐܘܪܫܠܡ | 104 | 2 | Matthew 2:1 | ܟܕ ܕܝܢ ܐܬܝܠܕ ܝܫܘܥ ܒܒܝܬ-ܠܚܡ ܕܝܗܘܕܐ ܒܝܘܡܝ ܗܪܘܕܣ ܡܠܟܐ ܐܬܘ ܡܓܘܫܐ ܡܢ ܡܕܢܚܐ ܠܐܘܪܫܠܡ \n",
"ܐܠܟܣܢܕܪܝܐ | 572 | 2 | Acts 6:9 | ܘܩܡܘ ܗܘܘ ܐܢܫܐ ܡܢ ܟܢܘܫܬܐ ܕܡܬܩܪܝܐ ܕܠܝܒܪܛܝܢܘ ܘܩܘܪܝܢܝܐ ܘܐܠܟܣܢܕܪܝܐ ܘܕܡܢ ܩܝܠܝܩܝܐ ܘܡܢ ܐܣܝܐ ܘܕܪܫܝܢ ܗܘܘ ܥܡ ܐܣܛܦܢܘܣ \n",
"ܐܢܛܝܘܟܝܐ | 10 995 | 44 | Acts 6:5 | ܘܫܦܪܬ ܗܕܐ ܡܠܬܐ ܩܕܡ ܟܠܗ ܥܡܐ ܘܓܒܘ ܠܐܣܛܦܢܘܣ ܓܒܪܐ ܕܡܠܐ ܗܘܐ ܗܝܡܢܘܬܐ ܘܪܘܚܐ ܕܩܘܕܫܐ ܘܠܦܝܠܝܦܘܣ ܘܠܦܪܟܪܘܣ ܘܠܢܝܩܢܘܪ ܘܠܛܝܡܘܢ ܘܠܦܪܡܢܐ ܘܠܢܝܩܠܐܘܣ ܓܝܘܪܐ ܐܢܛܝܘܟܝܐ \n",
"ܐܣܦܣ | 288 | 5 | Romans 3:13 | ܩܒܪܐ ܦܬܝܚܐ ܓܓܪܬܗܘܢ ܘܠܫܢܝܗܘܢ ܢܟܘܠܬܢܝܢ ܘܚܡܬܐ ܕܐܣܦܣ ܬܚܝܬ ܣܦܘܬܗܘܢ \n",
"ܐܦܣܘܣ | 623 | 69 | Acts 18:19 | ܘܡܛܝܘ ܠܐܦܣܘܣ ܘܥܠ ܦܘܠܘܣ ܠܟܢܘܫܬܐ ܘܡܡܠܠ ܗܘܐ ܥܡ ܝܗܘܕܝܐ \n",
"ܐܪܟ | 515 | 1 | Matthew 23:5 | ܘܟܠܗܘܢ ܥܒܕܝܗܘܢ ܥܒܕܝܢ ܕܢܬܚܙܘܢ ܠܒܢܝ ܐܢܫܐ ܡܦܬܝܢ ܓܝܪ ܬܦܠܝܗܘܢ ܘܡܘܪܟܝܢ ܬܟܠܬܐ ܕܡܪܛܘܛܝܗܘܢ \n",
"ܓܐܝܘܣ | 1494 | 1 | Acts 19:29 | ܘܐܫܬܓܫܬ ܟܠܗ ܡܕܝܢܬܐ ܘܪܗܛܘ ܐܟܚܕܐ ܘܐܙܠܘ ܠܬܐܛܪܘܢ ܘܚܛܦܘ ܐܘܒܠܘ ܥܡܗܘܢ ܠܓܐܝܘܣ ܘܠܐܪܣܛܪܟܘܣ ܓܒܪܐ ܡܩܕܘܢܝܐ ܒܢܝ ܠܘܝܬܗ ܕܦܘܠܘܣ \n",
"ܕܪܐ | 67 | 32 | Matthew 21:44 | ܘܡܢ ܕܢܦܠ ܥܠ ܟܐܦܐ ܗܕܐ ܢܬܪܥܥ ܘܟܠ ܡܢ ܕܗܝ ܬܦܠ ܥܠܘܗܝ ܬܕܪܝܘܗܝ \n",
"ܕܪܡܣܘܩ | 66 | 24 | Acts 9:2 | ܘܫܐܠ ܠܗ ܐܓܪܬܐ ܡܢ ܪܒ ܟܗܢܐ ܕܢܬܠ ܠܗ ܠܕܪܡܣܘܩ ܠܟܢܘܫܬܐ ܕܐܢ ܗܘ ܕܢܫܟܚ ܕܪܕܝܢ ܒܗܕܐ ܐܘܪܚܐ ܓܒܪܐ ܐܘ ܢܫܐ ܢܐܣܘܪ ܢܝܬܐ ܐܢܘܢ ܠܐܘܪܫܠܡ \n",
"ܚܘܪܐ | 1456 | 4 | Matthew 5:36 | ܐܦܠܐ ܒܪܫܟ ܬܐܡܐ ܕܠܐ ܡܫܟܚ ܐܢܬ ܠܡܥܒܕ ܒܗ ܡܢܬܐ ܚܕܐ ܕܣܥܪܐ ܐܘܟܡܬܐ ܐܘ ܚܘܪܬܐ \n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for (dataType, theseHits) in hits.items():\n",
" markdown = f\"\"\"### {dataType}s\n",
"lexeme | linked | n-occs | passage | verse text\n",
"--- | --- | --- | --- | ---\n",
"\"\"\"\n",
" for (lx, linked) in sorted(\n",
" theseHits.items(),\n",
" key=lambda x: F.lexeme.v(x[0]),\n",
" )[0:10]:\n",
" lex = F.lexeme.v(lx)\n",
" ids = \" \".join(sorted(linked))\n",
" occs = L.d(lx, otype=\"word\")\n",
" passage = \"{} {}:{}\".format(*T.sectionFromNode(occs[0]))\n",
" verse = L.u(occs[0], otype=\"verse\")[0]\n",
" text = T.text(L.d(verse, otype=\"word\"))\n",
" markdown += (\n",
" f'{lex} | {ids} | {len(occs)} | {passage} |'\n",
" f' {text}\\n'\n",
" )\n",
" display(Markdown(markdown))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:37:07.959937Z",
"start_time": "2018-04-14T08:37:07.950869Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"syriacaResolve = os.path.expanduser(f\"{REPO}/data/user/syriacaSyrNT.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:37:08.732700Z",
"start_time": "2018-04-14T08:37:08.715449Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/markdown": [
"### persons\n",
"lexeme | trans | url | applicable\n",
"--- | --- | --- | ---\n",
"ܐܒܐ | Aba of Nineveh | http://syriaca.org/person/1094 | no\n",
"ܐܒܐ | Abba | http://syriaca.org/person/2582 | no\n",
"ܐܒܐ | Aba | http://syriaca.org/person/308 | no\n",
"ܐܒܪܗܡ | Abraham | http://syriaca.org/person/964 | no\n",
"ܐܒܪܗܡ | Abraham | http://syriaca.org/person/1548 | no\n",
"ܐܒܪܗܡ | Abraham of Harran | http://syriaca.org/person/1549 | no\n",
"ܐܒܪܗܡ | Abraham | http://syriaca.org/person/1546 | no\n",
"ܐܒܪܗܡ | Abraham of the High Mountain | http://syriaca.org/person/1109 | no\n",
"ܐܒܪܗܡ | Abraham | http://syriaca.org/person/1110 | no\n",
"ܐܒܪܗܡ | Abraham | http://syriaca.org/person/1547 | no\n",
"ܐܒܪܗܡ | Abraham II of Adiabene | http://syriaca.org/person/1552 | no\n",
"ܐܒܪܗܡ | Abraham of Adiabene | http://syriaca.org/person/1551 | no\n",
"ܐܒܪܗܡ | Abraham | http://syriaca.org/person/2202 | no\n",
"ܐܒܪܗܡ | Abraham the Egyptian | http://syriaca.org/person/1553 | no\n",
"ܐܒܪܗܡ | Abraham the Priest | http://syriaca.org/person/1554 | no\n",
"ܐܒܪܗܡ | Abraham, bishop of Arbela | http://syriaca.org/person/1108 | no\n",
"ܐܕܝ | Addai | http://syriaca.org/person/1118 | no\n",
"ܐܕܝ | Addai | http://syriaca.org/person/1117 | no\n",
"ܐܕܝ | Addai | http://syriaca.org/person/2203 | no\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"### places\n",
"lexeme | trans | url | applicable\n",
"--- | --- | --- | ---\n",
"ܐܘܪܫܠܡ | Jerusalem (settlement) | http://syriaca.org/place/104 | no\n",
"ܐܠܟܣܢܕܪܝܐ | Alexandria (settlement) | http://syriaca.org/place/572 | no\n",
"ܐܢܛܝܘܟܝܐ | Antioch (settlement) | http://syriaca.org/place/10 | no\n",
"ܐܢܛܝܘܟܝܐ | Antioch (region) | http://syriaca.org/place/995 | no\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fieldNames = (\"lexeme\", \"trans\", \"url\", \"applicable\")\n",
"\n",
"fh = open(syriacaResolve, \"w\")\n",
"for (dataType, theseHits) in hits.items():\n",
" tsv = \"\\t\".join(fieldNames) + \"\\n\"\n",
" markdown = f\"\"\"### {dataType}s\n",
"{\" | \".join(fieldNames)}\n",
"--- | --- | --- | ---\n",
"\"\"\"\n",
" table = tables[dataType]\n",
" data = table[idF]\n",
" for (lx, linked) in sorted(\n",
" theseHits.items(),\n",
" key=lambda x: F.lexeme.v(x[0]),\n",
" )[0:3]:\n",
" lex = F.lexeme.v(lx)\n",
" for lid in linked:\n",
" trans = data[lid][0]\n",
" url = f\"{SC_URL}/{dataType}/{lid}\"\n",
" markdown += f'{lex} | {trans} | {url} | no\\n'\n",
" tsv += f\"{lex}\\t{trans}\\t{url}\\tno\\n\"\n",
"\n",
" fh.write(tsv)\n",
" display(Markdown(markdown))\n",
"fh.close()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Workbench (BHSA)\n",
"\n",
"![x](images/bibhum.png)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:37:35.023834Z",
"start_time": "2018-04-14T08:37:35.010873Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import sys # noqa F401\n",
"import os # noqa F401\n",
"import collections # noqa F401\n",
"from utils import structure, layout # noqa F401\n",
"from IPython.display import display, HTML, Markdown # noqa F401\n",
"from ipywidgets import interact, interactive, fixed, interact_manual # noqa F401\n",
"import ipywidgets as widgets # noqa F401\n",
"\n",
"from tf.Fabric import Fabric # noqa F401\n",
"from tf.extra.bhsa import Bhsa # noqa F401"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:42:06.678457Z",
"start_time": "2018-04-14T08:42:06.672729Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"VERSION = \"2017\"\n",
"BASE = \"~/github\"\n",
"ETCBC = f\"{BASE}/etcbc\"\n",
"BHSA = f\"bhsa/tf/{VERSION}\"\n",
"TREES = f\"lingo/trees/tf/{VERSION}\" # derived wefts\n",
"OSM = f\"bridging/tf/{VERSION}\" # wefts from the OSM crafts shop\n",
"PHONO = f\"phono/tf/{VERSION}\" # derived wefts\n",
"PARALLELS = f\"parallels/tf/{VERSION}\" # derived wefts"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:42:38.490848Z",
"start_time": "2018-04-14T08:42:32.458968Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This is Text-Fabric 3.2.5\n",
"Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api\n",
"Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb\n",
"Example data : https://github.com/Dans-labs/text-fabric-data\n",
"\n",
"124 features found and 0 ignored\n"
]
},
{
"data": {
"text/markdown": [
"**Documentation:** BHSA Feature docs BHSA API Text-Fabric API"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"This notebook online:\n",
"NBViewer\n",
"GitHub\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"TF = Fabric(locations=[ETCBC], modules=[BHSA, PHONO, PARALLELS, TREES, OSM])\n",
"api = TF.load(\n",
" \"\"\"\n",
" g_word_utf8 g_cons_utf8\n",
" voc_lex_utf8 gloss\n",
" phono crossref tree\n",
" osm\n",
"\"\"\",\n",
" silent=True,\n",
")\n",
"api.makeAvailableIn(globals())\n",
"\n",
"B = Bhsa(api, \"Copenhagen2018\", version=\"2017\")"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:43:07.260967Z",
"start_time": "2018-04-14T08:43:07.252325Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"verse = T.nodeFromSection((\"Genesis\", 1, 7))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:43:13.522499Z",
"start_time": "2018-04-14T08:43:13.507478Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
וַ
\n",
"
\n",
"
and
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
יַּ֣עַשׂ
\n",
"
\n",
"
make
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
אֱלֹהִים֮
\n",
"
\n",
"
god(s)
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
אֶת־
\n",
"
\n",
"
<object marker>
\n",
"\n",
"
\n",
"\n",
"
\n",
"
הָ
\n",
"
\n",
"
the
\n",
"\n",
"
\n",
"\n",
"
\n",
"
רָקִיעַ֒
\n",
"
\n",
"
firmament
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
וַ
\n",
"
\n",
"
and
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
יַּבְדֵּ֗ל
\n",
"
\n",
"
separate
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
בֵּ֤ין
\n",
"
\n",
"
interval
\n",
"\n",
"
\n",
"\n",
"
\n",
"
הַ
\n",
"
\n",
"
the
\n",
"\n",
"
\n",
"\n",
"
\n",
"
מַּ֨יִם֙
\n",
"
\n",
"
water
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
אֲשֶׁר֙
\n",
"
\n",
"
<relative>
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
מִ
\n",
"
\n",
"
from
\n",
"\n",
"
\n",
"\n",
"
\n",
"
תַּ֣חַת
\n",
"
\n",
"
under part
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
לָ
\n",
"
\n",
"
to
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
the
\n",
"\n",
"
\n",
"\n",
"
\n",
"
רָקִ֔יעַ
\n",
"
\n",
"
firmament
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
וּ
\n",
"
\n",
"
and
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
בֵ֣ין
\n",
"
\n",
"
interval
\n",
"\n",
"
\n",
"\n",
"
\n",
"
הַ
\n",
"
\n",
"
the
\n",
"\n",
"
\n",
"\n",
"
\n",
"
מַּ֔יִם
\n",
"
\n",
"
water
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
אֲשֶׁ֖ר
\n",
"
\n",
"
<relative>
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
מֵ
\n",
"
\n",
"
from
\n",
"\n",
"
\n",
"\n",
"
\n",
"
עַ֣ל
\n",
"
\n",
"
upon
\n",
"\n",
"
\n",
"\n",
"
\n",
"
לָ
\n",
"
\n",
"
to
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
the
\n",
"\n",
"
\n",
"\n",
"
\n",
"
רָקִ֑יעַ
\n",
"
\n",
"
firmament
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
וַֽ
\n",
"
\n",
"
and
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
יְהִי־
\n",
"
\n",
"
be
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
כֵֽן׃
\n",
"
\n",
"
thus
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"B.pretty(verse)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:43:28.416927Z",
"start_time": "2018-04-14T08:43:28.403738Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
וַ
\n",
"
\n",
"
and
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
יַּבְדֵּ֗ל
\n",
"
\n",
"
separate
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
בֵּ֤ין
\n",
"
\n",
"
interval
\n",
"\n",
"
\n",
"\n",
"
\n",
"
הַ
\n",
"
\n",
"
the
\n",
"\n",
"
\n",
"\n",
"
\n",
"
מַּ֨יִם֙
\n",
"
\n",
"
water
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
וּ
\n",
"
\n",
"
and
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
בֵ֣ין
\n",
"
\n",
"
interval
\n",
"\n",
"
\n",
"\n",
"
\n",
"
הַ
\n",
"
\n",
"
the
\n",
"\n",
"
\n",
"\n",
"
\n",
"
מַּ֔יִם
\n",
"
\n",
"
water
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"clause = 427572\n",
"B.pretty(clause)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:43:31.784153Z",
"start_time": "2018-04-14T08:43:31.772987Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"Genesis 1:7"
],
"text/plain": [
""
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HTML(B.shbLink(clause))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Queries\n",
"\n",
"![ku](images/stephen-ku.png)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:43:39.220232Z",
"start_time": "2018-04-14T08:43:39.210182Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"ellipQuery = \"\"\"\n",
"sentence\n",
" c1:clause\n",
" phrase function=Pred\n",
" word pdp=verb\n",
" c2:clause\n",
" phrase function=Pred\n",
" c3:clause typ=Ellp\n",
" phrase function=Objc\n",
" word pdp=subs|nmpr|prps|prde|prin\n",
" c1 << c2\n",
" c2 << c3\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:43:42.844145Z",
"start_time": "2018-04-14T08:43:40.091661Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1410"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = B.search(ellipQuery)\n",
"len(results)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:43:44.866625Z",
"start_time": "2018-04-14T08:43:44.835670Z"
}
},
"outputs": [],
"source": [
"def f(n):\n",
" B.show(results, n, n + 1, withNodes=True)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:43:50.634601Z",
"start_time": "2018-04-14T08:43:50.534911Z"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a6d7c70a0e734b5e906cafdbd6d08cfa",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=0, description='n', max=1409), Output()), _dom_classes=('widget-interact…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"interact(f, n=widgets.IntSlider(min=0, max=len(results) - 1, step=1, value=0))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# *You* can do this!\n",
"\n",
"because:\n",
"\n",
"* the text model works with proper *logic*:\n",
" * graph = nodes + edges + feature annotations\n",
" * very similar to the model of Emdros (MQL)\n",
"* the data packaging is for efficient *logistics*\n",
"* but do take a beginners course in **Python**"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Trees\n",
"\n",
"In 2013/2014 we\n",
"[extracted](https://github.com/ETCBC/lingo/blob/master/trees/trees.ipynb)\n",
"tree structures from the BHSA data.\n",
"\n",
"Every sentence has a tree associated with it, like this:\n",
"\n",
"```\n",
"(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))\n",
"```\n",
"The numbers refer to the words in the sentence."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Trees as feature\n",
"\n",
"The trees are available in a feature `tree`, defined for sentences.\n",
"\n",
"```\n",
"@node\n",
"@converter=Dirk Roorda\n",
"@convertor=trees.ipynb\n",
"@coreData=BHSA\n",
"@coreVersion=2017\n",
"@description=penn treebank represententation for sentences\n",
"@url=https://github.com/etcbc/lingo/trees/trees.ipynb\n",
"@valueType=str\n",
"@writtenBy=Text-Fabric\n",
"@dateWritten=2018-01-21T18:53:06Z\n",
"\n",
"1172209\t(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))\n",
"(S(C(CP(cj 0))(NP(dt 1)(n 2))(VP(vb 3))(NP(U(n 4))(cj 5)(U(n 6)))))\n",
"(S(C(CP(cj 0))(NP(n 1))(PP(pp 2)(U(n 3))(U(n 4)))))\n",
"(S(C(CP(cj 0))(NP(U(n 1))(U(n 2)))(VP(vb 3))(PP(pp 4)(U(n 5))(U(dt 6)(n 7)))))\n",
"\n",
"... and 60,000 more lines\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Trees are nice. But this output does **not** look nice."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Display\n",
"\n",
"We want\n",
"\n",
"* multiline view\n",
"* see the words\n",
"* phonetically\n",
"* with gloss\n",
"* and with **Open Scriptures Morphology** tag!"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:09.565512Z",
"start_time": "2018-04-14T08:44:09.559819Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Job 3:16 - first word = 336986\n",
"\n",
"tree =\n",
"(S(C(Ccoor(CP(cj 0))(PP(pp 1)(U(n 2))(U(vb 3)))(NegP(ng 4))(VP(vb 5)))(Ccoor(PP(pp 6)(n 7)(Cattr(NegP(ng 8))(VP(vb 9))(NP(n 10)))))))\n"
]
}
],
"source": [
"passage = (\"Job\", 3, 16)\n",
"passageStr = \"{} {}:{}\".format(*passage)\n",
"verse = T.nodeFromSection(passage)\n",
"sentence = L.d(verse, otype=\"sentence\")[0]\n",
"firstSlot = L.d(sentence, otype=\"word\")[0]\n",
"stringTree = F.tree.v(sentence)\n",
"print(f\"{passageStr} - first word = {firstSlot}\\n\\ntree =\\n{stringTree}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Parsing\n",
"\n",
"Parse it into a structure:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:11.848250Z",
"start_time": "2018-04-14T08:44:11.837350Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['S',\n",
" ['C',\n",
" ['Ccoor',\n",
" ['CP', [('cj', 0)]],\n",
" ['PP', [('pp', 1)], ['U', [('n', 2)]], ['U', [('vb', 3)]]],\n",
" ['NegP', [('ng', 4)]],\n",
" ['VP', [('vb', 5)]]],\n",
" ['Ccoor',\n",
" ['PP',\n",
" [('pp', 6)],\n",
" [('n', 7)],\n",
" ['Cattr',\n",
" ['NegP', [('ng', 8)]],\n",
" ['VP', [('vb', 9)]],\n",
" ['NP', [('n', 10)]]]]]]]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = structure(stringTree)\n",
"tree"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can display it a bit more friendly:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:13.740602Z",
"start_time": "2018-04-14T08:44:13.736048Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" S\n",
" C\n",
" Ccoor\n",
" CP\n",
" cj 336986\n",
" PP\n",
" pp 336987\n",
" U\n",
" n 336988\n",
" U\n",
" vb 336989\n",
" NegP\n",
" ng 336990\n",
" VP\n",
" vb 336991\n",
" Ccoor\n",
" PP\n",
" pp 336992\n",
" n 336993\n",
" Cattr\n",
" NegP\n",
" ng 336994\n",
" VP\n",
" vb 336995\n",
" NP\n",
" n 336996\n"
]
}
],
"source": [
"print(layout(tree, firstSlot, str))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Note that the `layout()` has replaced the relative word numbers in the sentence by absolute slot numbers in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Weaving the wefts ...\n",
"All wefts are there, we have to weave them around each warp."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:17.703769Z",
"start_time": "2018-04-14T08:44:17.697447Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def osmPhonoGloss(n):\n",
" lexNode = L.u(n, otype=\"lex\")[0]\n",
" return '{{{}}} \"{}\" [{}] = {}'.format(\n",
" F.osm.v(n),\n",
" F.g_word_utf8.v(n),\n",
" F.phono.v(n),\n",
" F.gloss.v(lexNode), # gloss is a feature on lexemes, not words\n",
" # F.voc_lex_utf8.v(lexNode),\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"## ... into a weave"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:19.728752Z",
"start_time": "2018-04-14T08:44:19.723849Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1 S\n",
" 2 C\n",
" 3 Ccoor\n",
" 4 CP\n",
" 5 cj {HC} \"אֹ֚ו\" [ˈʔô] = or\n",
" 4 PP\n",
" 5 pp {HR} \"כְ\" [ḵᵊ] = as\n",
" 5 U\n",
" 6 n {HNcmsa} \"נֵ֣פֶל\" [nˈēfel] = miscarriage\n",
" 5 U\n",
" 6 vb {HVqsmsa} \"טָ֭מוּן\" [ˈṭāmûn] = hide\n",
" 4 NegP\n",
" 5 ng {HTn} \"לֹ֣א\" [lˈō] = not\n",
" 4 VP\n",
" 5 vb {HVqi1cs} \"אֶהְיֶ֑ה\" [ʔehyˈeh] = be\n",
" 3 Ccoor\n",
" 4 PP\n",
" 5 pp {HR} \"כְּ֝\" [ˈkᵊ] = as\n",
" 5 n {HNcmpa} \"עֹלְלִ֗ים\" [ʕōlᵊlˈîm] = child\n",
" 5 Cattr\n",
" 6 NegP\n",
" 7 ng {HTn} \"לֹא\" [lō-] = not\n",
" 6 VP\n",
" 7 vb {HVqp3cp} \"רָ֥אוּ\" [rˌāʔû] = see\n",
" 6 NP\n",
" 7 n {HNcbsa} \"אֹֽור\" [ʔˈôr] = light\n"
]
}
],
"source": [
"print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:20.749230Z",
"start_time": "2018-04-14T08:44:20.744085Z"
}
},
"outputs": [],
"source": [
"def showTree(s):\n",
" t = F.tree.v(s)\n",
" tree = structure(t)\n",
" firstSlot = L.d(s, otype=\"word\")[0]\n",
" label = \"{} {}:{}\".format(*T.sectionFromNode(firstSlot))\n",
" print(label)\n",
" print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))\n",
" return 0\n",
"\n",
"\n",
"sentenceInfo = [c for c in C.levels.data if c[0] == \"sentence\"][0]\n",
"minSentence = sentenceInfo[2]\n",
"maxSentence = sentenceInfo[3]"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:22.047279Z",
"start_time": "2018-04-14T08:44:22.004535Z"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "fa907cf1432f43e097a6fdb7888eda8f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=1172209, description='s', max=1235919, min=1172209), Output()), _dom_cla…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"interact(\n",
" showTree,\n",
" s=widgets.IntSlider(\n",
" min=minSentence,\n",
" max=maxSentence,\n",
" step=1,\n",
" value=minSentence,\n",
" ),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# No leaking of concerns\n",
"\n",
"* The TREES module knows nothing of OS morphology\n",
"* OS morphology is not aware of TREES\n",
"* *thank goodness*\n",
"* But they are woven cosily together in one display\n",
"\n",
"![a](images/amos.png)\n",
"\n",
"On Perseus, via [Scaife](https://scaife.perseus.org/library/urn:cts:ancJewLit:hebBible/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Carried away by tree structures\n",
"\n",
"The raw strings are handy for structure analysis, in a way the woven trees cannot be.\n",
"\n",
"Let us see how many distinct tree structures we've got.\n",
"\n",
"**liberate yourself from micro-management**"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:25.408319Z",
"start_time": "2018-04-14T08:44:25.284509Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"28096 distinct trees of 63711 in total\n"
]
}
],
"source": [
"treeDistribution = F.tree.freqList()\n",
"\n",
"distinct = len(treeDistribution)\n",
"total = sum(x[1] for x in treeDistribution)\n",
"\n",
"print(f\"{distinct} distinct trees of {total} in total\")"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:26.294852Z",
"start_time": "2018-04-14T08:44:26.289010Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3772 x (S(C(CP(cj 0))(VP(vb 1))))\n",
"1238 x (S(C(VP(vb 0))))\n",
"1173 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))))\n",
" 857 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n 3))))\n",
" 749 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))\n",
" 577 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))))\n",
" 568 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(dt 3)(n 4))))\n",
" 554 x (S(C(VP(vb 0))(NP(n 1))))\n",
" 441 x (S(C(CP(cj 0))(NegP(ng 1))(VP(vb 2))))\n",
" 406 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n-pr 3))))\n"
]
}
],
"source": [
"for (tree, amount) in treeDistribution[0:10]:\n",
" print(f\"{amount:>4} x {tree}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"I'm intrigued by the most frequent tree structure.\n",
"\n",
"Which verbs occur in such a sentence? Let's find out."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:28.690556Z",
"start_time": "2018-04-14T08:44:28.575889Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"501 lexemes found\n"
]
}
],
"source": [
"lexemes = collections.Counter()\n",
"short = treeDistribution[0][0]\n",
"for s in F.otype.s(\"sentence\"):\n",
" if F.tree.v(s) == short:\n",
" verb = L.d(s, otype=\"word\")[1]\n",
" lexeme = L.u(verb, otype=\"lex\")[0]\n",
" lexemes[lexeme] += 1\n",
"print(f\"{len(lexemes)} lexemes found\")"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:29.427472Z",
"start_time": "2018-04-14T08:44:29.420788Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1045 x 1437422 \"אמר\" = say\n",
" 203 x 1437412 \"היה\" = be\n",
" 107 x 1437561 \"הלך\" = walk\n",
" 106 x 1437570 \"מות\" = die\n",
" 87 x 1437424 \"ראה\" = see\n",
" 80 x 1437574 \"בוא\" = come\n",
" 71 x 1437645 \"שׁוב\" = return\n",
" 70 x 1437569 \"אכל\" = eat\n",
" 52 x 1437685 \"קום\" = arise\n",
" 45 x 1437654 \"חיה\" = be alive\n"
]
}
],
"source": [
"for (lex, amount) in sorted(\n",
" lexemes.items(),\n",
" key=lambda x: (-x[1], x[0]),\n",
")[0:10]:\n",
" print(f'{amount:>4} x {lex} \"{F.voc_lex_utf8.v(lex)}\" = {F.gloss.v(lex)}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"\n",
"# Open Scriptures Morphology"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* align the WLC with the BHS\n",
"* compare the OSM with the BHSA"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Aligning\n",
"\n",
"See [BHSAbridgeOSM.ipynb](https://github.com/ETCBC/bridging/blob/master/programs/BHSAbridgeOSM.ipynb)\n",
"\n",
"* performs a consonant by consonant alignment between the WLC and BHS\n",
"* stumbled on a few cases requiring a hint:"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"```python\n",
"exceptions = {\n",
" 215253: 1,\n",
" 266189: 1,\n",
" 287360: 2,\n",
" 376865: 1,\n",
" 383405: 2,\n",
" 384049: 1,\n",
" 384050: 1,\n",
" 405102: -2,\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"```\n",
"Succeeded in aligning BHS with OSM\n",
"420103 BHS words matched against 469448 OSM morphemes with 8 known exceptions\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Spotting the anomalies\n",
"\n",
"With a bit of weaving, these exceptions are:"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"```\n",
"Isaiah 9:6\n",
" BHS 215253 = מרבה\n",
" OSM w1 = םרבה\n",
"Ezekiel 4:6\n",
" BHS 266189 = ימוני\n",
" OSM w7 = ימיני\n",
"```\n",
"\n",
"```\n",
"Ezekiel 43:11\n",
" BHS 287360 = צורתו\n",
" OSM w17, w17 = צורת/י\n",
"Daniel 10:19\n",
" BHS 376865 = כְ\n",
" OSM w10 = בְ\n",
"Ezra 10:44\n",
" BHS 383405 = נשׂאו\n",
" OSM w3, w3 = נשא/י\n",
"\n",
"```\n",
"\n",
"```\n",
"Nehemiah 2:13\n",
" BHS 384049 = הם\n",
" OSM w17 = ה\n",
"Nehemiah 2:13\n",
" BHS 384050 = פרוצים\n",
" OSM w17 = מפרוצים\n",
"```\n",
"\n",
"```\n",
"1_Chronicles 27:12\n",
" BHS 405102, 405103 = בן/ימיני\n",
" OSM w6 = בנימיני\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Word breaking\n",
"\n",
"There are cases where the OSM and the BHSA differ in the breaking-up of words."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"```\n",
"OSM morphemes without BHSA word: 0\n",
"OSM morphemes with multiple BHSA words: 130\n",
"OSM morphemes with 2 BHSA words: 123\n",
"OSM morphemes with 3 BHSA words: 7\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Unfinished\n",
"\n",
"The OSM is not yet finished.\n",
"\n",
"We made a list of word nodes for which no morpheme has been tagged"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"53841 =~ 10% unfinished."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"```\n",
"Non-marked-up stretches having length x: y times\n",
" 1: 14990\n",
" 2: 8336\n",
" 3: 2802\n",
" 4: 1090\n",
" 5: 493\n",
" 6: 285\n",
" 7: 162\n",
" 8: 90\n",
" 9: 70\n",
" 10: 37\n",
" 11: 33\n",
" 12: 19\n",
" 13: 11\n",
" 14: 17\n",
" 15: 9\n",
" 16: 7\n",
" 17: 2\n",
" 18: 2\n",
" 19: 6\n",
" 20: 1\n",
" 21: 1\n",
" 22: 3\n",
" 23: 2\n",
" 25: 2\n",
" 26: 2\n",
" 27: 1\n",
" 28: 1\n",
" 29: 1\n",
" 32: 1\n",
" 33: 1\n",
" 35: 1\n",
" 36: 1\n",
" 38: 2\n",
" 41: 1\n",
" 47: 1\n",
" 60: 1\n",
" 61: 1\n",
" 72: 1\n",
" 74: 1\n",
" 75: 1\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"What remains is: filling in the dots!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We will carry out the comparison for unproblematic words:\n",
"\n",
"* good alignment ( 8 BHSA words excluded)\n",
"* same word breaks (276 BHSA words excluded)\n",
"* morph tags available"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Result: OSM module\n",
"\n",
"Two new TF features:\n",
"* `osm.tf` (main words)\n",
"* `osm_sf.tf` (suffixes)\n",
"\n",
"Together: the **OSM module**"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"```\n",
"@node\n",
"@conversion=notebook openscriptures in BHSA repo\n",
"@conversion_author=Dirk Roorda\n",
"@coreData=BHSA\n",
"@coreVersion=2017\n",
"@description=primary morphology string according to OpenScriptures\n",
"@source=Open Scriptures\n",
"@source_url=https://github.com/openscriptures/morphhb\n",
"@valueType=str\n",
"@writtenBy=Text-Fabric\n",
"@dateWritten=2018-01-12T13:21:01Z\n",
"\n",
"HR\n",
"HNcfsa\n",
"HVqp3ms\n",
"HNcmpa\n",
"HTo\n",
"HTd\n",
"HNcmpa\n",
"HC\n",
"HTo\n",
"HTd\n",
"HNcbsa\n",
"\n",
"... and 400,000 more lines\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Comparing\n",
"\n",
"We compare *categories*."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"In [OSM](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html):\n",
"* part-of-speech\n",
"* and their subtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"In BHSA the features:\n",
"* [sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp.html) = part of speech\n",
"* [ls](https://etcbc.github.io/bhsa/features/hebrew/2017/ls.html) = lexical set\n",
"* [nametype](https://etcbc.github.io/bhsa/features/hebrew/2017/nametype.html)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"\n",
"\n",
"## OSM categories\n",
"\n",
"```python\n",
"pspOSM = {\n",
" '': dict(\n",
" A='adjective',\n",
" C='conjunction',\n",
" D='adverb',\n",
" N='noun',\n",
" P='pronoun',\n",
" R='preposition',\n",
" S='suffix',\n",
" T='particle',\n",
" V='verb',\n",
" ),\n",
" 'A': dict(\n",
" a='adjective',\n",
" c='cardinal number',\n",
" g='gentilic',\n",
" o='ordinal number',\n",
" ),\n",
" 'N': dict(\n",
" c='common',\n",
" g='gentilic',\n",
" p='proper name',\n",
" ),\n",
" 'P': dict(\n",
" d='demonstrative',\n",
" f='indefinite',\n",
" i='interrogative',\n",
" p='personal',\n",
" r='relative',\n",
" ),\n",
" 'R': dict(\n",
" d='definite article',\n",
" ),\n",
" 'S': dict(\n",
" d='directional he',\n",
" h='paragogic he',\n",
" n='paragogic nun',\n",
" p='pronominal',\n",
" ),\n",
" 'T': dict(\n",
" a='affirmation',\n",
" d='definite article',\n",
" e='exhortation',\n",
" i='interrogative',\n",
" j='interjection',\n",
" m='demonstrative',\n",
" n='negative',\n",
" o='direct object marker',\n",
" r='relative',\n",
" ),\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"\n",
"\n",
"## BHSA categories\n",
"\n",
"```python\n",
"spBHS = dict(\n",
" art='article',\n",
" verb='verb',\n",
" subs='noun',\n",
" nmpr='proper noun',\n",
" advb='adverb',\n",
" prep='preposition',\n",
" conj='conjunction',\n",
" prps='personal pronoun',\n",
" prde='demonstrative pronoun',\n",
" prin='interrogative pronoun',\n",
" intj='interjection',\n",
" nega='negative particle',\n",
" inrg='interrogative particle',\n",
" adjv='adjective',\n",
")\n",
"lsBHS = dict(\n",
" nmdi='distributive noun',\n",
" nmcp='copulative noun',\n",
" padv='potential adverb',\n",
" afad='anaphoric adverb',\n",
" ppre='potential preposition',\n",
" cjad='conjunctive adverb',\n",
" ordn='ordinal',\n",
" vbcp='copulative verb',\n",
" mult='noun of multitude',\n",
" focp='focus particle',\n",
" ques='interrogative particle',\n",
" gntl='gentilic',\n",
" quot='quotation verb',\n",
" card='cardinal',\n",
" none=MISSING,\n",
")\n",
"nametypeBHS = dict(\n",
" pers='person',\n",
" mens='measurement unit',\n",
" gens='people',\n",
" topo='place',\n",
" ppde='demonstrative personal pronoun',\n",
")\n",
"nametypeBHS.update({\n",
" 'pers,gens,topo': 'person',\n",
" 'pers,gens': 'person',\n",
" 'gens,topo': 'gentilic',\n",
" 'pers,god': 'person',\n",
"})\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Better dumb than smart\n",
"\n",
"We just counted the pairs of OSM, BHSA categories that co-occurred on words.\n",
"\n",
"A selection of the outcomes.\n",
"\n",
"This is OSM versus BHSA"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Verbs\n",
"\n",
"```\n",
"verb\n",
"\tverb:: ( 84% = 50691x)\n",
"\tverb:quotation verb: ( 10% = 6137x)\n",
"\tverb:copulative verb: ( 5% = 3246x)\n",
"\tnoun:: ( 0% = 6x)\n",
"\tadjective:: ( 0% = 3x)\n",
"\tpreposition:: ( 0% = 1x)\n",
"\tproper noun:: ( 0% = 1x)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"**Excellent** Just 11 discrepancies in 60,000 cases, 99.98% !"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Prepositions\n",
"\n",
"```\n",
"preposition\n",
"\tpreposition:: ( 96% = 50697x)\n",
"\tnoun:potential preposition: ( 3% = 1643x)\n",
"\tadverb:conjunctive adverb: ( 0% = 194x)\n",
"\tinterrogative particle:: ( 0% = 169x)\n",
"\tnoun:cardinal: ( 0% = 13x)\n",
"\tconjunction:: ( 0% = 5x)\n",
"\tnoun:: ( 0% = 2x)\n",
"\tproper noun:: ( 0% = 2x)\n",
"\tarticle:: ( 0% = 1x)\n",
"\tverb:: ( 0% = 1x)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-14T08:44:56.284384Z",
"start_time": "2018-04-14T08:44:56.279550Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Discrepancies: 0.76% = 387x out of 51084\n"
]
}
],
"source": [
"disc = 194 + 169 + 13 + 5 + 2 + 2 + 1 + 1\n",
"tot = 50697 + disc\n",
"discPerc = round(100 * disc / tot, 2)\n",
"print(f\"Discrepancies: {discPerc}% = {disc}x out of {tot}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Attention needed!\n",
"\n",
"* all *rare* cases have been collected into a big list\n",
"* context info has been woven into the list\n",
"* there are 645 such cases\n",
"* see [allCategoriesCases.tsv](https://github.com/ETCBC/bridging/blob/master/programs/allCategoriesCases.tsv) on GitHub"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"![1](images/cases1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"![2](images/cases2.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Follow up?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* inspect the rare cases:\n",
" * these might be glitches, in BHSA or in OSM or in both\n",
" * these might be disputable cases: add them to the docs"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* inspect the majority cases: which categories map to which?\n",
" * maybe some categories can be harmonized\n",
" * if that is not desirable: we can generate an exhaustive mapping"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* in the end: we can make a BHSA-OSM category mapping that is\n",
" * comprehensive\n",
" * machine-readable\n",
" * documented"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Conclusions"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"## BHSA versus OSM\n",
" is awesome\n",
"\n",
" is terrific"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"\n",
"$\\gt($ awesome $+$ terrific $)$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Data, Logic, and Logistics: Text-Fabric\n",
"\n",
"\n",
"[Text-Fabric](https://github.com/Dans-labs/text-fabric) is a\n",
"* [model](https://github.com/Dans-labs/text-fabric/wiki/Data-Model)\n",
"* [file format](https://annotation.github.io/text-fabric/tf/about/fileformats.html)\n",
"* [tool](https://annotation.github.io/text-fabric/tf/cheatsheet.html)\n",
"\n",
"to support the logistics of the interchange of textual treasures\n",
"\n",
"So that you ...\n",
"\n",
"* ... researcher and tinkerer\n",
"* ... programming theologian\n",
"\n",
"* can grab parts from GitHub\n",
"* bring them to your shed\n",
"* and join them together on your workbench"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"\n",
"\n",
"**Designed especially for you** - Thank you\n",
"\n",
"dirk.roorda@dans.knaw.nl"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"jupytext": {
"encoding": "# -*- coding: utf-8 -*-"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
},
"livereveal": {
"scroll": true,
"start_slideshow_at": "selected"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"toc_cell": false,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": false
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}