{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# CALAP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**About CALAP**. CALAP stands for Computer-Assisted Linguistic Analysis of the Peshitta.\n", "CALAP was a [project](https://openaccess.leidenuniv.nl/handle/1887/10866) at the University of Leiden.\n", "The [Peshitta](http://en.wikipedia.org/wiki/Peshitta) is a collection of Syriac texts. According to Wikipedia it is the standard version of the Bible in churches of the Syriac tradition. Resources can be found on [peshitta.org](http://www.peshitta.org). \n", "\n", "The text we use below comes from the [Peshitta Institute Leiden](http://www.hum.leiden.edu/religion/research/peshitta-institute/peshitta-institute.html), and has been prepared as an EMDROS database, which is now held by the [ETCBC](http://www.godgeleerdheid.vu.nl/etcbc). \n", "\n", "From there is has been converted to [LAF](http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326)\n", "by Dirk Roorda, and this notebook accesses this LAF data by means of\n", "[LAF-Fabric](http://laf-fabric.readthedocs.org/en/latest/).\n", "\n", "The LAF-data of the CALAP project has been archived at DANS:\n", "[DOI 10.17026/dans-zv9-w9d2](http://dx.doi.org/10.17026/dans-zv9-w9d2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text from features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here comes the plain text of the CALAP data.\n", "\n", "The CALAP database only contains the surface consonants as textual representation." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.8.3\n", "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n", "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n", "\n" ] } ], "source": [ "import sys\n", "import collections\n", "\n", "from etcbc.lib import Transcription\n", "from laf.fabric import LafFabric\n", "fabric = LafFabric()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.00s USING main: calap DATA COMPILED AT: 2016-10-31T11-56-40\n", " 0.28s LOGFILE=/Users/dirk/laf/laf-fabric-output/calap/plain/__log__plain.txt\n", " 0.28s INFO: DATA LOADED FROM SOURCE calap AND ANNOX FOR TASK plain AT 2016-10-31T11-56-50\n" ] } ], "source": [ "fabric.load('calap', '--', 'plain', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype\n", " surface_consonants\n", " psp\n", " book chapter verse verse_label\n", " ''',''),\n", " \"primary\": True,\n", "})\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<,>,B,C,D,G,H,J,K,L,M,N,P,Q,R,S,T,V,W,X,Y,Z\n" ] } ], "source": [ "plain_file = outfile(\"calap_plain.txt\")\n", "\n", "tr = Transcription()\n", "catalog = set()\n", "for i in F.otype.s('word'):\n", " sf = F.surface_consonants.v(i)\n", " for x in sf: catalog.add(x)\n", " the_text = tr.to_syriac(sf)\n", " plain_file.write(the_text + ' ')\n", "\n", "plain_file.close()\n", "print(','.join(sorted(catalog)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file does not have newlines, it is a blob of consonant transcriptions for each word separated by spaces." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Passage indicators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want books, chapters and verses marked, you can achieve it in the following way:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "141613 Sirach \n" ] } ], "source": [ "plainx_file = outfile(\"calap_plainx.txt\")\n", "\n", "the_book = None\n", "the_chapter = None\n", "the_verse = None\n", "\n", "for i in NN():\n", " this_type = F.otype.v(i)\n", " if this_type == \"word\":\n", " the_text = tr.to_syriac(F.surface_consonants.v(i))\n", " the_suffix = ' '\n", " plainx_file.write(the_text + the_suffix)\n", " elif this_type == \"book\":\n", " the_book = F.book.v(i)\n", " sys.stderr.write(\"\\r{:>6} {:<30}\".format(i, the_book)) \n", " plainx_file.write(\"\\n{}\".format(the_book))\n", " elif this_type == \"chapter\":\n", " the_chapter = F.chapter.v(i)\n", " plainx_file.write(\"\\n{} {}\".format(the_book, the_chapter))\n", " elif this_type == \"verse\":\n", " the_verse = F.verse.v(i)\n", " plainx_file.write(\"\\n{}:{} \".format(the_chapter, the_verse))\n", "sys.stderr.write(\"\\n\")\n", "\n", "plainx_file.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to show the syriac text, you need to install a font that has glyphs for the syriac unicode characters (0700 - 074F).\n", "For example: Estrangelo Edessa from [Meltho](http://www.bethmardutho.org/index.php/resources/fonts.html)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "I_Kings\r\n", "I_Kings 1\r\n", "1:1 ܘ ܡܠܟܐ ܕܘܝܕ ܣܐܒ ܘ ܥܠ ܒ ܫܢܝܐ ܘ ܡܟܣܝܢ ܗܘܘ ܠܗ ܒ ܠܒܘܫܐ ܘ ܠܐ ܫܚܢ \r\n", "1:2 ܘ ܐܡܪܘ ܠܗ ܥܒܕܘܗܝ ܗܐ ܥܒܕܝܟ ܩܕܡܝܟ ܢܒܥܘܢ ܠ ܡܪܢ ܡܠܟܐ ܥܠܝܡܬܐ ܒܬܘܠܬܐ ܘ ܬܩܘܡ ܩܕܡ ܡܠܟܐ ܘ ܬܗܘܐ ܠܗ ܡܫܡܫܢܝܬܐ ܘ ܬܫܟܒ ܒ ܥܘܒܟ ܘ ܢܫܚܢ ܠ ܡܪܢ ܡܠܟܐ \r\n", "1:3 ܘ ܒܥܘ ܥܠܝܡܬܐ ܕ ܫܦܝܪܐ ܒ ܟܠܗ ܬܚܘܡܐ ܕ ܐܝܣܪܝܠ ܘ ܐܫܟܚܘ ܠ ܐܒܝܫܓ ܫܝܠܘܡܝܬܐ ܘ ܐܝܬܝܘܗ ܠ ܡܠܟܐ \r\n", "1:4 ܘ ܥܠܝܡܬܐ ܫܦܝܪܐ ܗܘܬ ܒ ܚܙܘܗ ܛܒ ܘ ܗܘܬ ܠ ܡܠܟܐ ܡܫܡܫܢܝܬܐ ܘ ܡܫܡܫܐ ܠܗ ܘ ܡܠܟܐ ܠܐ ܝܕܥܗ \r\n", "1:5 ܘ ܐܕܘܢܝܐ ܒܪ ܚܓܝܬ ܡܬܪܘܪܒ ܘ ܐܡܪ ܐܢܐ ܐܡܠܟ ܘ ܥܒܕ ܠܗ ܡܪܟܒܬܐ ܘ ܦܪܫܐ ܘ ܚܡܫܝܢ ܓܒܪܝܢ ܕ ܪܗܛܝܢ ܗܘܘ ܩܕܡܘܗܝ \r\n", "1:6 ܘ ܠܐ ܟܐܐ ܒܗ ܐܒܘܗܝ ܡܢ ܝܘܡܘܗܝ ܘ ܐܡܪ ܠܗ ܡܛܠ ܡܢܐ ܗܟܢܐ ܥܒܕ ܐܢܬ ܘ ܐܦ ܗܘ ܫܦܝܪ ܗܘܐ ܒ ܚܙܘܗ ܛܒ ܘ ܠܗ ܝܠܕܬ ܒܬܪ ܐܒܫܠܘܡ \r", "\r\n", "1:7 ܘ ܗܘܘ ܦܬܓܡܘܗܝ ܥܡ ܝܘܐܒ ܒܪ ܨܘܪܝܐ ܘ ܥܡ ܐܒܝܬܪ ܟܗܢܐ ܘ ܡܥܕܪܝܢ ܒܬܪ ܐܕܘܢܝܐ \r\n" ] } ], "source": [ "!head -n 10 {my_file('calap_plainx.txt')}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are in an environment where you do not have this font installed, see the screenshot at the top screenshot.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Verse list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get the text in a quite different way: just read it from the *primary data*.\n", "\n", "Let us do that per verse." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "verse_file = outfile(\"calap_verses.txt\")\n", "\n", "for i in F.otype.s('verse'):\n", " the_text = tr.to_syriac(''.join([txt for (j, txt) in P.data(i)]))\n", " the_verse = F.verse_label.v(i)\n", " verse_file.write(\"{}\\n{}\\n\".format(the_verse, the_text))\n", "\n", "verse_file.close()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1R 1,1\r\n", "ܘ ܡܠܟܐ ܕܘܝܕ ܣܐܒ ܘ ܥܠ ܒ ܫܢܝܐ ܘ ܡܟܣܝܢ ܗܘܘ ܠܗ ܒ ܠܒܘܫܐ ܘ ܠܐ ܫܚܢ \r\n", "1R 1,2\r\n", "ܘ ܐܡܪܘ ܠܗ ܥܒܕܘܗܝ ܗܐ ܥܒܕܝܟ ܩܕܡܝܟ ܢܒܥܘܢ ܠ ܡܪܢ ܡܠܟܐ ܥܠܝܡܬܐ ܒܬܘܠܬܐ ܘ ܬܩܘܡ ܩܕܡ ܡܠܟܐ ܘ ܬܗܘܐ ܠܗ ܡܫܡܫܢܝܬܐ ܘ ܬܫܟܒ ܒ ܥܘܒܟ ܘ ܢܫܚܢ ܠ ܡܪܢ ܡܠܟܐ \r\n", "1R 1,3\r\n", "ܘ ܒܥܘ ܥܠܝܡܬܐ ܕ ܫܦܝܪܐ ܒ ܟܠܗ ܬܚܘܡܐ ܕ ܐܝܣܪܝܠ ܘ ܐܫܟܚܘ ܠ ܐܒܝܫܓ ܫܝܠܘܡܝܬܐ ܘ ܐܝܬܝܘܗ ܠ ܡܠܟܐ \r\n", "1R 1,4\r\n", "ܘ ܥܠܝܡܬܐ ܫܦܝܪܐ ܗܘܬ ܒ ܚܙܘܗ ܛܒ ܘ ܗܘܬ ܠ ܡܠܟܐ ܡܫܡܫܢܝܬܐ ܘ ܡܫܡܫܐ ܠܗ ܘ ܡܠܟܐ ܠܐ ܝܕܥܗ \r\n", "1R 1,5\r\n", "ܘ ܐܕܘܢܝܐ ܒܪ ܚܓܝܬ ܡܬܪܘܪܒ ܘ ܐܡܪ ܐܢܐ ܐܡܠܟ ܘ ܥܒܕ ܠܗ ܡܪܟܒܬܐ ܘ ܦܪܫܐ ܘ ܚܡܫܝܢ ܓܒܪܝܢ ܕ ܪܗܛܝܢ ܗܘܘ ܩܕܡܘܗܝ \r\n" ] } ], "source": [ "!head -n 10 {my_file('calap_verses.txt')}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Empty words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the BHS there are words that have an empty representation.\n", "\n", "Let us have a closer look to the CALAP.\n", "Are there empty words?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No empty words found\n" ] } ], "source": [ "ewords = collections.defaultdict(lambda: [])\n", "verse = None\n", "\n", "for i in NN(test=F.otype.v, values=['verse', 'word']):\n", " if F.otype.v(i) == 'verse':\n", " verse = i\n", " continue\n", " text = F.surface_consonants.v(i)\n", " if text == '':\n", " lex = lexeme.v(i)\n", " pos = F.psp.v(i)\n", " ewords[(lex, pos)].append(verse)\n", "\n", "for (item, occs) in sorted(ewords.items(), key=lambda x: (-len(x[1]), x[0][1], x[0][0])):\n", " print(\"{:>6} x {:<15} = {:>10} in {}{}\".format(\n", " len(occs), \n", " item[1], \n", " item[0], \n", " \"; \".join([F.verse_label.v(j) for j in occs][0:5]),\n", " ' ...' if len(occs) > 20 else '',\n", " ))\n", "if not len(ewords):\n", " print(\"No empty words found\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 23s Results directory:\n", "/Users/dirk/laf/laf-fabric-output/calap/plain\n", "\n", ".DS_Store 6148 Thu Apr 17 18:26:31 2014\n", "__log__plain.txt 202 Mon Oct 31 12:57:13 2016\n", "calap_plain.txt 380716 Mon Oct 31 12:56:54 2016\n", "calap_plainx.txt 399490 Mon Oct 31 12:56:57 2016\n", "calap_verses.txt 408012 Mon Oct 31 12:57:05 2016\n" ] } ], "source": [ "close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }