{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"http://laf-fabric.readthedocs.org/en/latest/\" target=\"_blank\"><img align=\"left\" src=\"images/laf-fabric-xsmall.png\"/></a>\n",
    "<a href=\"http://www.godgeleerdheid.vu.nl/etcbc\" target=\"_blank\"><img align=\"left\" src=\"images/VU-ETCBC-xsmall.png\"/></a>\n",
    "<a href=\"http://tla.mpi.nl\" target=\"_blank\"><img align=\"right\" src=\"images/TLA-xsmall.png\"/></a>\n",
    "<a href=\"http://www.dans.knaw.nl\" target=\"_blank\"><img align=\"right\"src=\"images/DANS-xsmall.png\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img align=\"left\" src=\"images/1kings-syriac-shot.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CALAP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**About CALAP**. CALAP stands for Computer-Assisted Linguistic Analysis of the Peshitta.\n",
    "CALAP was a [project](https://openaccess.leidenuniv.nl/handle/1887/10866) at the University of Leiden.\n",
    "The [Peshitta](http://en.wikipedia.org/wiki/Peshitta) is a collection of Syriac texts. According to Wikipedia it is the standard version of the Bible in churches of the Syriac tradition. Resources can be found on [peshitta.org](http://www.peshitta.org). \n",
    "\n",
    "The text we use below comes from the [Peshitta Institute Leiden](http://www.hum.leiden.edu/religion/research/peshitta-institute/peshitta-institute.html), and has been prepared as an EMDROS database, which is now held by the [ETCBC](http://www.godgeleerdheid.vu.nl/etcbc). \n",
    "\n",
    "From there is has been converted to [LAF](http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326)\n",
    "by Dirk Roorda, and this notebook accesses this LAF data by means of\n",
    "[LAF-Fabric](http://laf-fabric.readthedocs.org/en/latest/).\n",
    "\n",
    "The LAF-data of the CALAP project has been archived at DANS:\n",
    "[DOI 10.17026/dans-zv9-w9d2](http://dx.doi.org/10.17026/dans-zv9-w9d2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text from features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here comes the plain text of the CALAP data.\n",
    "\n",
    "The CALAP database only contains the surface consonants as textual representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0.00s This is LAF-Fabric 4.8.3\n",
      "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n",
      "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "import collections\n",
    "\n",
    "from etcbc.lib import Transcription\n",
    "from laf.fabric import LafFabric\n",
    "fabric = LafFabric()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0.00s LOADING API: please wait ... \n",
      "  0.00s USING main: calap DATA COMPILED AT: 2016-10-31T11-56-40\n",
      "  0.28s LOGFILE=/Users/dirk/laf/laf-fabric-output/calap/plain/__log__plain.txt\n",
      "  0.28s INFO: DATA LOADED FROM SOURCE calap AND ANNOX  FOR TASK plain AT 2016-10-31T11-56-50\n"
     ]
    }
   ],
   "source": [
    "fabric.load('calap', '--', 'plain', {\n",
    "    \"xmlids\": {\"node\": False, \"edge\": False},\n",
    "    \"features\": ('''\n",
    "        otype\n",
    "        surface_consonants\n",
    "        psp\n",
    "        book chapter verse verse_label\n",
    "    ''',''),\n",
    "    \"primary\": True,\n",
    "})\n",
    "exec(fabric.localnames.format(var='fabric'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<,>,B,C,D,G,H,J,K,L,M,N,P,Q,R,S,T,V,W,X,Y,Z\n"
     ]
    }
   ],
   "source": [
    "plain_file = outfile(\"calap_plain.txt\")\n",
    "\n",
    "tr = Transcription()\n",
    "catalog = set()\n",
    "for i in F.otype.s('word'):\n",
    "    sf = F.surface_consonants.v(i)\n",
    "    for x in sf: catalog.add(x)\n",
    "    the_text = tr.to_syriac(sf)\n",
    "    plain_file.write(the_text + ' ')\n",
    "\n",
    "plain_file.close()\n",
    "print(','.join(sorted(catalog)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This file does not have newlines, it is a blob of consonant transcriptions for each word separated by spaces."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Passage indicators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want books, chapters and verses marked, you can achieve it in the following way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "141613 Sirach                        \n"
     ]
    }
   ],
   "source": [
    "plainx_file = outfile(\"calap_plainx.txt\")\n",
    "\n",
    "the_book = None\n",
    "the_chapter = None\n",
    "the_verse = None\n",
    "\n",
    "for i in NN():\n",
    "    this_type = F.otype.v(i)\n",
    "    if this_type == \"word\":\n",
    "        the_text = tr.to_syriac(F.surface_consonants.v(i))\n",
    "        the_suffix = ' '\n",
    "        plainx_file.write(the_text + the_suffix)\n",
    "    elif this_type == \"book\":\n",
    "        the_book = F.book.v(i)\n",
    "        sys.stderr.write(\"\\r{:>6} {:<30}\".format(i, the_book)) \n",
    "        plainx_file.write(\"\\n{}\".format(the_book))\n",
    "    elif this_type == \"chapter\":\n",
    "        the_chapter = F.chapter.v(i)\n",
    "        plainx_file.write(\"\\n{} {}\".format(the_book, the_chapter))\n",
    "    elif this_type == \"verse\":\n",
    "        the_verse = F.verse.v(i)\n",
    "        plainx_file.write(\"\\n{}:{} \".format(the_chapter, the_verse))\n",
    "sys.stderr.write(\"\\n\")\n",
    "\n",
    "plainx_file.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to show the syriac text, you need to install a font that has glyphs for the syriac unicode characters (0700 - 074F).\n",
    "For example: Estrangelo Edessa from [Meltho](http://www.bethmardutho.org/index.php/resources/fonts.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r\n",
      "I_Kings\r\n",
      "I_Kings 1\r\n",
      "1:1 ܘ ܡܠܟܐ ܕܘܝܕ ܣܐܒ ܘ ܥܠ ܒ ܫܢܝܐ ܘ ܡܟܣܝܢ ܗܘܘ ܠܗ ܒ ܠܒܘܫܐ ܘ ܠܐ ܫܚܢ \r\n",
      "1:2 ܘ ܐܡܪܘ ܠܗ ܥܒܕܘܗܝ ܗܐ ܥܒܕܝܟ ܩܕܡܝܟ ܢܒܥܘܢ ܠ ܡܪܢ ܡܠܟܐ ܥܠܝܡܬܐ ܒܬܘܠܬܐ ܘ ܬܩܘܡ ܩܕܡ ܡܠܟܐ ܘ ܬܗܘܐ ܠܗ ܡܫܡܫܢܝܬܐ ܘ ܬܫܟܒ ܒ ܥܘܒܟ ܘ ܢܫܚܢ ܠ ܡܪܢ ܡܠܟܐ \r\n",
      "1:3 ܘ ܒܥܘ ܥܠܝܡܬܐ ܕ ܫܦܝܪܐ ܒ ܟܠܗ ܬܚܘܡܐ ܕ ܐܝܣܪܝܠ ܘ ܐܫܟܚܘ ܠ ܐܒܝܫܓ ܫܝܠܘܡܝܬܐ ܘ ܐܝܬܝܘܗ ܠ ܡܠܟܐ \r\n",
      "1:4 ܘ ܥܠܝܡܬܐ ܫܦܝܪܐ ܗܘܬ ܒ ܚܙܘܗ ܛܒ ܘ ܗܘܬ ܠ ܡܠܟܐ ܡܫܡܫܢܝܬܐ ܘ ܡܫܡܫܐ ܠܗ ܘ ܡܠܟܐ ܠܐ ܝܕܥܗ \r\n",
      "1:5 ܘ ܐܕܘܢܝܐ ܒܪ ܚܓܝܬ ܡܬܪܘܪܒ ܘ ܐܡܪ ܐܢܐ ܐܡܠܟ ܘ ܥܒܕ ܠܗ ܡܪܟܒܬܐ ܘ ܦܪܫܐ ܘ ܚܡܫܝܢ ܓܒܪܝܢ ܕ ܪܗܛܝܢ ܗܘܘ ܩܕܡܘܗܝ \r\n",
      "1:6 ܘ ܠܐ ܟܐܐ ܒܗ ܐܒܘܗܝ ܡܢ ܝܘܡܘܗܝ ܘ ܐܡܪ ܠܗ ܡܛܠ ܡܢܐ ܗܟܢܐ ܥܒܕ ܐܢܬ ܘ ܐܦ ܗܘ ܫܦܝܪ ܗܘܐ ܒ ܚܙܘܗ ܛܒ ܘ ܠܗ ܝܠܕܬ ܒܬܪ ܐܒܫܠܘܡ \r",
      "\r\n",
      "1:7 ܘ ܗܘܘ ܦܬܓܡܘܗܝ ܥܡ ܝܘܐܒ ܒܪ ܨܘܪܝܐ ܘ ܥܡ ܐܒܝܬܪ ܟܗܢܐ ܘ ܡܥܕܪܝܢ ܒܬܪ ܐܕܘܢܝܐ \r\n"
     ]
    }
   ],
   "source": [
    "!head -n 10 {my_file('calap_plainx.txt')}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are in an environment where you do not have this font installed, see the screenshot at the top screenshot.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Verse list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get the text in a quite different way: just read it from the *primary data*.\n",
    "\n",
    "Let us do that per verse."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "verse_file = outfile(\"calap_verses.txt\")\n",
    "\n",
    "for i in F.otype.s('verse'):\n",
    "    the_text = tr.to_syriac(''.join([txt for (j, txt) in P.data(i)]))\n",
    "    the_verse = F.verse_label.v(i)\n",
    "    verse_file.write(\"{}\\n{}\\n\".format(the_verse, the_text))\n",
    "\n",
    "verse_file.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1R 1,1\r\n",
      "ܘ ܡܠܟܐ ܕܘܝܕ ܣܐܒ ܘ ܥܠ ܒ ܫܢܝܐ ܘ ܡܟܣܝܢ ܗܘܘ ܠܗ ܒ ܠܒܘܫܐ ܘ ܠܐ ܫܚܢ \r\n",
      "1R 1,2\r\n",
      "ܘ ܐܡܪܘ ܠܗ ܥܒܕܘܗܝ ܗܐ ܥܒܕܝܟ ܩܕܡܝܟ ܢܒܥܘܢ ܠ ܡܪܢ ܡܠܟܐ ܥܠܝܡܬܐ ܒܬܘܠܬܐ ܘ ܬܩܘܡ ܩܕܡ ܡܠܟܐ ܘ ܬܗܘܐ ܠܗ ܡܫܡܫܢܝܬܐ ܘ ܬܫܟܒ ܒ ܥܘܒܟ ܘ ܢܫܚܢ ܠ ܡܪܢ ܡܠܟܐ \r\n",
      "1R 1,3\r\n",
      "ܘ ܒܥܘ ܥܠܝܡܬܐ ܕ ܫܦܝܪܐ ܒ ܟܠܗ ܬܚܘܡܐ ܕ ܐܝܣܪܝܠ ܘ ܐܫܟܚܘ ܠ ܐܒܝܫܓ ܫܝܠܘܡܝܬܐ ܘ ܐܝܬܝܘܗ ܠ ܡܠܟܐ \r\n",
      "1R 1,4\r\n",
      "ܘ ܥܠܝܡܬܐ ܫܦܝܪܐ ܗܘܬ ܒ ܚܙܘܗ ܛܒ ܘ ܗܘܬ ܠ ܡܠܟܐ ܡܫܡܫܢܝܬܐ ܘ ܡܫܡܫܐ ܠܗ ܘ ܡܠܟܐ ܠܐ ܝܕܥܗ \r\n",
      "1R 1,5\r\n",
      "ܘ ܐܕܘܢܝܐ ܒܪ ܚܓܝܬ ܡܬܪܘܪܒ ܘ ܐܡܪ ܐܢܐ ܐܡܠܟ ܘ ܥܒܕ ܠܗ ܡܪܟܒܬܐ ܘ ܦܪܫܐ ܘ ܚܡܫܝܢ ܓܒܪܝܢ ܕ ܪܗܛܝܢ ܗܘܘ ܩܕܡܘܗܝ \r\n"
     ]
    }
   ],
   "source": [
    "!head -n 10 {my_file('calap_verses.txt')}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Empty words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the BHS there are words that have an empty representation.\n",
    "\n",
    "Let us have a closer look to the CALAP.\n",
    "Are there empty words?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No empty words found\n"
     ]
    }
   ],
   "source": [
    "ewords = collections.defaultdict(lambda: [])\n",
    "verse = None\n",
    "\n",
    "for i in NN(test=F.otype.v, values=['verse', 'word']):\n",
    "    if F.otype.v(i) == 'verse':\n",
    "        verse = i\n",
    "        continue\n",
    "    text = F.surface_consonants.v(i)\n",
    "    if text == '':\n",
    "        lex = lexeme.v(i)\n",
    "        pos = F.psp.v(i)\n",
    "        ewords[(lex, pos)].append(verse)\n",
    "\n",
    "for (item, occs) in sorted(ewords.items(), key=lambda x: (-len(x[1]), x[0][1], x[0][0])):\n",
    "    print(\"{:>6} x {:<15} = {:>10} in {}{}\".format(\n",
    "        len(occs), \n",
    "        item[1], \n",
    "        item[0], \n",
    "        \"; \".join([F.verse_label.v(j) for j in occs][0:5]),\n",
    "        ' ...' if len(occs) > 20 else '',\n",
    "    ))\n",
    "if not len(ewords):\n",
    "    print(\"No empty words found\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "    23s Results directory:\n",
      "/Users/dirk/laf/laf-fabric-output/calap/plain\n",
      "\n",
      ".DS_Store                              6148 Thu Apr 17 18:26:31 2014\n",
      "__log__plain.txt                        202 Mon Oct 31 12:57:13 2016\n",
      "calap_plain.txt                      380716 Mon Oct 31 12:56:54 2016\n",
      "calap_plainx.txt                     399490 Mon Oct 31 12:56:57 2016\n",
      "calap_verses.txt                     408012 Mon Oct 31 12:57:05 2016\n"
     ]
    }
   ],
   "source": [
    "close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}