{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lexemes\n",
"\n",
"Various ways to list the lexeme base of individual chapters in the Hebew Bible.\n",
"\n",
"The *lexeme base* of a passage is the set of lexemes that occurs in that passage.\n",
"\n",
"We define a function, ``lexbase(passages, excluded=xpassages)``, \n",
"that produces a file of the lexemes that occur in a given list of passages and do not occur in an other given list of passages.\n",
"\n",
"If you have LAF-Fabric working and downloaded this notebook, you can call this function yourself in order to generate \n",
"lexeme bases of arbitrary passages.\n",
"\n",
"We also produce standard files with the lexeme bases of individual books, chapters and verses in the Bible.\n",
"\n",
"\n",
"# Output\n",
"\n",
"The output files are organized as follows:\n",
"\n",
"* all files are comma separated text files that can imported in a spreadsheet application such as OpenOffice or Excel;\n",
"* every line corresponds to a lexeme in the lexeme base and contains the following information:\n",
" * lexeme (unique identifier in transcription, containing `` / [ = `` characters),\n",
" * frequency (number of occurrences of this lexeme in the whole Hebrew Bible),\n",
" * ``lex_utf8`` feature (the lexeme in Hebrew as it occurs in the ETCBC text database),\n",
" * ``g_entry_heb`` feature (the vocalized lexeme as it is listed in the ETCBC lexicon),\n",
" * ``sp`` feature (part of speech),\n",
" * ``gloss`` feature.\n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0.00s This is LAF-Fabric 4.5.0\n",
"API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n",
"Feature doc: http://shebanq-doc.readthedocs.org/en/latest/texts/welcome.html\n",
"\n"
]
}
],
"source": [
"import sys, collections, re\n",
"\n",
"from laf.fabric import LafFabric\n",
"from etcbc.preprocess import prepare\n",
"fabric = LafFabric()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0.00s LOADING API: please wait ... \n",
" 0.65s INFO: USING DATA COMPILED AT: 2015-05-04T13-46-20\n",
" 0.65s INFO: USING DATA COMPILED AT: 2015-05-04T14-07-34\n",
" 3.67s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/lexemes/__log__lexemes.txt\n",
" 14s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK lexemes AT 2015-05-27T15-43-40\n"
]
}
],
"source": [
"version = '4b'\n",
"fabric.load('etcbc{}'.format(version), 'lexicon', 'lexemes', {\n",
" \"xmlids\": {\"node\": False, \"edge\": False},\n",
" \"features\": ('''\n",
" otype\n",
" lex lex_utf8 g_entry_heb\n",
" sp gloss\n",
" book chapter verse\n",
" ''',''),\n",
" \"prepare\": prepare,\n",
" \"primary\": False,\n",
"})\n",
"exec(fabric.localnames.format(var='fabric'))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"csvdir = my_file('csv')\n",
"passagedir = my_file('passage')\n",
"%mkdir -p {csvdir}\n",
"%mkdir -p {passagedir}"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"# Passage syntax\n",
"\n",
"passages = | separated list of passage\n",
"passage = bookname (chapterranges | (chapter : verseranges))\n",
"chapterranges = empty | (, separated list of numberrange)\n",
"verseranges = empty | (, separated list of numberrange)\n",
"numberrange = number | (number - number)\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"passage_pat = re.compile('^\\s*([A-Za-z0-9_]+)\\s*([0-9,-]*)\\s*:?\\s*([0-9,-]*)\\s*$')\n",
"\n",
"lex_info = {}\n",
"lex_section = {}\n",
"lex_count = collections.Counter()\n",
"for v in F.otype.s('verse'):\n",
" bk = F.book.v(L.u('book', v))\n",
" ch = F.chapter.v(L.u('chapter', v))\n",
" vs = F.verse.v(v)\n",
" for w in L.d('word', v):\n",
" lex = F.lex.v(w)\n",
" if lex not in lex_info:\n",
" lex_info[lex] = (F.lex_utf8.v(w), F.g_entry_heb.v(w), F.sp.v(w), F.gloss.v(w))\n",
" lex_section.setdefault(bk, {}).setdefault(ch, {}).setdefault(vs, collections.Counter())[lex] += 1\n",
" lex_count[lex] += 1\n",
"\n",
"def verse_index():\n",
" result = {}\n",
" for v in F.verse.s():\n",
" bk = F.book.v(L.u('book', v))\n",
" ch = F.chapter.v(L.u('chapter', v))\n",
" vs = F.verse.v(v)\n",
" result.setdefault(bk, {}).setdefault(ch, {})[vs] = v\n",
" return result\n",
"\n",
"vindex = verse_index()\n",
"\n",
"def parse_passages(passages):\n",
" lexemes = set()\n",
" for p in passages.strip().split('|'):\n",
" lexemes |= parse_passage(p.strip())\n",
" return lexemes\n",
"\n",
"def parse_ranges(rangespec, kind, passage, source, subsources=None):\n",
" numbers = set()\n",
" if rangespec == '':\n",
" if subsources == None:\n",
" return set(source.keys())\n",
" else:\n",
" for subsource in subsources:\n",
" if subsource in source:\n",
" numbers |= set(source[subsource].keys())\n",
" return numbers\n",
" ranges = rangespec.split(',')\n",
" good = True\n",
" for r in ranges:\n",
" comps = r.split('-', 1)\n",
" if len(comps) == 1:\n",
" b = comps[0]\n",
" e = comps[0]\n",
" else:\n",
" (b,e) = comps\n",
" if not (b.isdigit() and e.isdigit()):\n",
" print('Error: Not a valid {} range: [{}] in [{}]'.format(kind, r, passage))\n",
" good = False\n",
" else:\n",
" b = int(b)\n",
" e = int(e)\n",
" for c in range(b, e+1):\n",
" crep = str(c)\n",
" if subsources == None:\n",
" if crep not in source:\n",
" print('Warning: No such {}: {} ([{}] in [{}])'.format(kind, crep, r, passage))\n",
" numbers.add(crep)\n",
" else:\n",
" for subsource in subsources:\n",
" if subsource not in source or crep not in source[subsource]:\n",
" print('Warning: No such {}: {}:{} ([{}] in [{}])'.format(kind, subsource, crep, r, passage))\n",
" numbers.add(crep)\n",
" return numbers\n",
" \n",
"def parse_passage(passage):\n",
" lexemes = set()\n",
" result = passage_pat.match(passage)\n",
" if result == None:\n",
" print('Error: Not a valid passage: {}'.format(passage))\n",
" return lexemes\n",
" (book, chapterspec, versespec) = result.group(1,2,3)\n",
" if book not in vindex:\n",
" print('Error: Not a valid book: {} in {}'.format(book, passage))\n",
" return lexemes\n",
" chapters = parse_ranges(chapterspec, 'chapter', passage, vindex[book])\n",
" verses = parse_ranges(versespec, 'verse', passage, vindex[book], chapters)\n",
"\n",
" vnodes = set()\n",
" for ch in vindex[book]:\n",
" if ch not in chapters: continue\n",
" for vs in vindex[book][ch]:\n",
" if vs not in verses: continue\n",
" vnodes.add(vindex[book][ch][vs])\n",
" lexemes = set()\n",
" for v in vnodes:\n",
" for w in L.d('word', v):\n",
" lexemes.add(F.lex.v(w))\n",
" return lexemes\n",
" \n",
"def lexbase(passages, excluded=None):\n",
" lexemes = parse_passages(passages)\n",
" outlexemes = set() if excluded == None else parse_passages(excluded)\n",
" lexemes -= outlexemes\n",
" fileid = '{}{}'.format(\n",
" passages, \n",
" '' if excluded == None else ' minus {}'.format(excluded)\n",
" )\n",
" filename = 'passage/{}.csv'.format(fileid.replace(':','_'))\n",
" of = outfile(filename)\n",
" i = 0\n",
" limit = 20\n",
" nlex = len(lexemes)\n",
" shown = min((nlex, limit))\n",
" print('==== {} ==== showing {} of {} lexemes here ===='.format(fileid, shown, nlex))\n",
" for lx in sorted(lexemes, key=lambda x: (-lex_count[x], x)):\n",
" (l_utf8, l_vc, l_sp, l_gl) = lex_info[lx]\n",
" line = '\"{}\",{},{}\",\"{}\",\"{}\",\"{}\"\\n'.format(lx, lex_count[lx], l_utf8, l_vc, l_sp, l_gl)\n",
" of.write(line)\n",
" if i < limit: sys.stdout.write(line)\n",
" i += 1\n",
" of.close()\n",
" print('See {}\\n'.format(my_file(filename)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Examples\n",
"\n",
"Here are some examples of the flexibility with which you can call the ``lexbase`` function."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==== Genesis 2 ==== showing 20 of 131 lexemes here ====\n",
"\"W\",51004,ו\",\"וְ\",\"conj\",\"and\"\n",
"\"H\",30386,ה\",\"הַ\",\"art\",\"the\"\n",
"\"L\",20447,ל\",\"לְ\",\"prep\",\"to\"\n",
"\"B\",15767,ב\",\"בְּ\",\"prep\",\"in\"\n",
"\">T\",11017,את\",\"אֵת\",\"prep\",\"