{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CALAP"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**About CALAP**. CALAP stands for Computer-Assisted Linguistic Analysis of the Peshitta.\n",
"CALAP was a [project](https://openaccess.leidenuniv.nl/handle/1887/10866) at the University of Leiden.\n",
"The [Peshitta](http://en.wikipedia.org/wiki/Peshitta) is a collection of Syriac texts. According to Wikipedia it is the standard version of the Bible in churches of the Syriac tradition. Resources can be found on [peshitta.org](http://www.peshitta.org). \n",
"\n",
"The text we use below comes from the [Peshitta Institute Leiden](http://www.hum.leiden.edu/religion/research/peshitta-institute/peshitta-institute.html), and has been prepared as an EMDROS database, which is now held by the [ETCBC](http://www.godgeleerdheid.vu.nl/etcbc). \n",
"\n",
"From there is has been converted to [LAF](http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326)\n",
"by Dirk Roorda, and this notebook accesses this LAF data by means of\n",
"[LAF-Fabric](http://laf-fabric.readthedocs.org/en/latest/).\n",
"\n",
"The LAF-data of the CALAP project has been archived at DANS:\n",
"[DOI 10.17026/dans-zv9-w9d2](http://dx.doi.org/10.17026/dans-zv9-w9d2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text from features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here comes the plain text of the CALAP data.\n",
"\n",
"The CALAP database only contains the surface consonants as textual representation."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0.00s This is LAF-Fabric 4.8.3\n",
"API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n",
"Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n",
"\n"
]
}
],
"source": [
"import sys\n",
"import collections\n",
"\n",
"from etcbc.lib import Transcription\n",
"from laf.fabric import LafFabric\n",
"fabric = LafFabric()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0.00s LOADING API: please wait ... \n",
" 0.00s USING main: calap DATA COMPILED AT: 2016-10-31T11-56-40\n",
" 0.28s LOGFILE=/Users/dirk/laf/laf-fabric-output/calap/plain/__log__plain.txt\n",
" 0.28s INFO: DATA LOADED FROM SOURCE calap AND ANNOX FOR TASK plain AT 2016-10-31T11-56-50\n"
]
}
],
"source": [
"fabric.load('calap', '--', 'plain', {\n",
" \"xmlids\": {\"node\": False, \"edge\": False},\n",
" \"features\": ('''\n",
" otype\n",
" surface_consonants\n",
" psp\n",
" book chapter verse verse_label\n",
" ''',''),\n",
" \"primary\": True,\n",
"})\n",
"exec(fabric.localnames.format(var='fabric'))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<,>,B,C,D,G,H,J,K,L,M,N,P,Q,R,S,T,V,W,X,Y,Z\n"
]
}
],
"source": [
"plain_file = outfile(\"calap_plain.txt\")\n",
"\n",
"tr = Transcription()\n",
"catalog = set()\n",
"for i in F.otype.s('word'):\n",
" sf = F.surface_consonants.v(i)\n",
" for x in sf: catalog.add(x)\n",
" the_text = tr.to_syriac(sf)\n",
" plain_file.write(the_text + ' ')\n",
"\n",
"plain_file.close()\n",
"print(','.join(sorted(catalog)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This file does not have newlines, it is a blob of consonant transcriptions for each word separated by spaces."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Passage indicators"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want books, chapters and verses marked, you can achieve it in the following way:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"141613 Sirach \n"
]
}
],
"source": [
"plainx_file = outfile(\"calap_plainx.txt\")\n",
"\n",
"the_book = None\n",
"the_chapter = None\n",
"the_verse = None\n",
"\n",
"for i in NN():\n",
" this_type = F.otype.v(i)\n",
" if this_type == \"word\":\n",
" the_text = tr.to_syriac(F.surface_consonants.v(i))\n",
" the_suffix = ' '\n",
" plainx_file.write(the_text + the_suffix)\n",
" elif this_type == \"book\":\n",
" the_book = F.book.v(i)\n",
" sys.stderr.write(\"\\r{:>6} {:<30}\".format(i, the_book)) \n",
" plainx_file.write(\"\\n{}\".format(the_book))\n",
" elif this_type == \"chapter\":\n",
" the_chapter = F.chapter.v(i)\n",
" plainx_file.write(\"\\n{} {}\".format(the_book, the_chapter))\n",
" elif this_type == \"verse\":\n",
" the_verse = F.verse.v(i)\n",
" plainx_file.write(\"\\n{}:{} \".format(the_chapter, the_verse))\n",
"sys.stderr.write(\"\\n\")\n",
"\n",
"plainx_file.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to show the syriac text, you need to install a font that has glyphs for the syriac unicode characters (0700 - 074F).\n",
"For example: Estrangelo Edessa from [Meltho](http://www.bethmardutho.org/index.php/resources/fonts.html)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r\n",
"I_Kings\r\n",
"I_Kings 1\r\n",
"1:1 ܘ ܡܠܟܐ ܕܘܝܕ ܣܐܒ ܘ ܥܠ ܒ ܫܢܝܐ ܘ ܡܟܣܝܢ ܗܘܘ ܠܗ ܒ ܠܒܘܫܐ ܘ ܠܐ ܫܚܢ \r\n",
"1:2 ܘ ܐܡܪܘ ܠܗ ܥܒܕܘܗܝ ܗܐ ܥܒܕܝܟ ܩܕܡܝܟ ܢܒܥܘܢ ܠ ܡܪܢ ܡܠܟܐ ܥܠܝܡܬܐ ܒܬܘܠܬܐ ܘ ܬܩܘܡ ܩܕܡ ܡܠܟܐ ܘ ܬܗܘܐ ܠܗ ܡܫܡܫܢܝܬܐ ܘ ܬܫܟܒ ܒ ܥܘܒܟ ܘ ܢܫܚܢ ܠ ܡܪܢ ܡܠܟܐ \r\n",
"1:3 ܘ ܒܥܘ ܥܠܝܡܬܐ ܕ ܫܦܝܪܐ ܒ ܟܠܗ ܬܚܘܡܐ ܕ ܐܝܣܪܝܠ ܘ ܐܫܟܚܘ ܠ ܐܒܝܫܓ ܫܝܠܘܡܝܬܐ ܘ ܐܝܬܝܘܗ ܠ ܡܠܟܐ \r\n",
"1:4 ܘ ܥܠܝܡܬܐ ܫܦܝܪܐ ܗܘܬ ܒ ܚܙܘܗ ܛܒ ܘ ܗܘܬ ܠ ܡܠܟܐ ܡܫܡܫܢܝܬܐ ܘ ܡܫܡܫܐ ܠܗ ܘ ܡܠܟܐ ܠܐ ܝܕܥܗ \r\n",
"1:5 ܘ ܐܕܘܢܝܐ ܒܪ ܚܓܝܬ ܡܬܪܘܪܒ ܘ ܐܡܪ ܐܢܐ ܐܡܠܟ ܘ ܥܒܕ ܠܗ ܡܪܟܒܬܐ ܘ ܦܪܫܐ ܘ ܚܡܫܝܢ ܓܒܪܝܢ ܕ ܪܗܛܝܢ ܗܘܘ ܩܕܡܘܗܝ \r\n",
"1:6 ܘ ܠܐ ܟܐܐ ܒܗ ܐܒܘܗܝ ܡܢ ܝܘܡܘܗܝ ܘ ܐܡܪ ܠܗ ܡܛܠ ܡܢܐ ܗܟܢܐ ܥܒܕ ܐܢܬ ܘ ܐܦ ܗܘ ܫܦܝܪ ܗܘܐ ܒ ܚܙܘܗ ܛܒ ܘ ܠܗ ܝܠܕܬ ܒܬܪ ܐܒܫܠܘܡ \r",
"\r\n",
"1:7 ܘ ܗܘܘ ܦܬܓܡܘܗܝ ܥܡ ܝܘܐܒ ܒܪ ܨܘܪܝܐ ܘ ܥܡ ܐܒܝܬܪ ܟܗܢܐ ܘ ܡܥܕܪܝܢ ܒܬܪ ܐܕܘܢܝܐ \r\n"
]
}
],
"source": [
"!head -n 10 {my_file('calap_plainx.txt')}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are in an environment where you do not have this font installed, see the screenshot at the top screenshot.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Verse list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get the text in a quite different way: just read it from the *primary data*.\n",
"\n",
"Let us do that per verse."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"verse_file = outfile(\"calap_verses.txt\")\n",
"\n",
"for i in F.otype.s('verse'):\n",
" the_text = tr.to_syriac(''.join([txt for (j, txt) in P.data(i)]))\n",
" the_verse = F.verse_label.v(i)\n",
" verse_file.write(\"{}\\n{}\\n\".format(the_verse, the_text))\n",
"\n",
"verse_file.close()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1R 1,1\r\n",
"ܘ ܡܠܟܐ ܕܘܝܕ ܣܐܒ ܘ ܥܠ ܒ ܫܢܝܐ ܘ ܡܟܣܝܢ ܗܘܘ ܠܗ ܒ ܠܒܘܫܐ ܘ ܠܐ ܫܚܢ \r\n",
"1R 1,2\r\n",
"ܘ ܐܡܪܘ ܠܗ ܥܒܕܘܗܝ ܗܐ ܥܒܕܝܟ ܩܕܡܝܟ ܢܒܥܘܢ ܠ ܡܪܢ ܡܠܟܐ ܥܠܝܡܬܐ ܒܬܘܠܬܐ ܘ ܬܩܘܡ ܩܕܡ ܡܠܟܐ ܘ ܬܗܘܐ ܠܗ ܡܫܡܫܢܝܬܐ ܘ ܬܫܟܒ ܒ ܥܘܒܟ ܘ ܢܫܚܢ ܠ ܡܪܢ ܡܠܟܐ \r\n",
"1R 1,3\r\n",
"ܘ ܒܥܘ ܥܠܝܡܬܐ ܕ ܫܦܝܪܐ ܒ ܟܠܗ ܬܚܘܡܐ ܕ ܐܝܣܪܝܠ ܘ ܐܫܟܚܘ ܠ ܐܒܝܫܓ ܫܝܠܘܡܝܬܐ ܘ ܐܝܬܝܘܗ ܠ ܡܠܟܐ \r\n",
"1R 1,4\r\n",
"ܘ ܥܠܝܡܬܐ ܫܦܝܪܐ ܗܘܬ ܒ ܚܙܘܗ ܛܒ ܘ ܗܘܬ ܠ ܡܠܟܐ ܡܫܡܫܢܝܬܐ ܘ ܡܫܡܫܐ ܠܗ ܘ ܡܠܟܐ ܠܐ ܝܕܥܗ \r\n",
"1R 1,5\r\n",
"ܘ ܐܕܘܢܝܐ ܒܪ ܚܓܝܬ ܡܬܪܘܪܒ ܘ ܐܡܪ ܐܢܐ ܐܡܠܟ ܘ ܥܒܕ ܠܗ ܡܪܟܒܬܐ ܘ ܦܪܫܐ ܘ ܚܡܫܝܢ ܓܒܪܝܢ ܕ ܪܗܛܝܢ ܗܘܘ ܩܕܡܘܗܝ \r\n"
]
}
],
"source": [
"!head -n 10 {my_file('calap_verses.txt')}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Empty words"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the BHS there are words that have an empty representation.\n",
"\n",
"Let us have a closer look to the CALAP.\n",
"Are there empty words?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No empty words found\n"
]
}
],
"source": [
"ewords = collections.defaultdict(lambda: [])\n",
"verse = None\n",
"\n",
"for i in NN(test=F.otype.v, values=['verse', 'word']):\n",
" if F.otype.v(i) == 'verse':\n",
" verse = i\n",
" continue\n",
" text = F.surface_consonants.v(i)\n",
" if text == '':\n",
" lex = lexeme.v(i)\n",
" pos = F.psp.v(i)\n",
" ewords[(lex, pos)].append(verse)\n",
"\n",
"for (item, occs) in sorted(ewords.items(), key=lambda x: (-len(x[1]), x[0][1], x[0][0])):\n",
" print(\"{:>6} x {:<15} = {:>10} in {}{}\".format(\n",
" len(occs), \n",
" item[1], \n",
" item[0], \n",
" \"; \".join([F.verse_label.v(j) for j in occs][0:5]),\n",
" ' ...' if len(occs) > 20 else '',\n",
" ))\n",
"if not len(ewords):\n",
" print(\"No empty words found\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 23s Results directory:\n",
"/Users/dirk/laf/laf-fabric-output/calap/plain\n",
"\n",
".DS_Store 6148 Thu Apr 17 18:26:31 2014\n",
"__log__plain.txt 202 Mon Oct 31 12:57:13 2016\n",
"calap_plain.txt 380716 Mon Oct 31 12:56:54 2016\n",
"calap_plainx.txt 399490 Mon Oct 31 12:56:57 2016\n",
"calap_verses.txt 408012 Mon Oct 31 12:57:05 2016\n"
]
}
],
"source": [
"close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}