{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Attributives\n", "\n", "We want to make a list of all nouns with their adjectival modifiers.\n", "We produce a tab separated file of phrases which contain a noun and adjectival modifiers.\n", "The columns are\n", "\n", "1. passage label\n", "1. phrase text\n", "1. phrase gloss\n", "1. head of an attributive subphrase\n", "1. attributive subphrase\n", "1. number of words in the head\n", "1. number of nouns in the head\n", "\n", "Hebrew text is represented in ETCBC consonantal transcription, for ease of importing it in Excel.\n", "It is not difficult to generate fully vocalized Hebrew, but then you need OpenOffice to open the csv file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.8.3\n", "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n", "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n", "\n" ] } ], "source": [ "import sys, os\n", "import collections\n", "\n", "import laf\n", "from laf.fabric import LafFabric\n", "from etcbc.preprocess import prepare\n", "fabric = LafFabric()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading the feature data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.01s DETAIL: COMPILING m: etcbc4b: UP TO DATE\n", " 0.01s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56\n", " 0.01s DETAIL: COMPILING a: lexicon: UP TO DATE\n", " 0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54\n", " 0.03s DETAIL: load main: G.node_anchor_min\n", " 0.12s DETAIL: load main: G.node_anchor_max\n", " 0.21s DETAIL: load main: G.node_sort\n", " 0.27s DETAIL: load main: G.node_sort_inv\n", " 0.65s DETAIL: load main: G.edges_from\n", " 0.71s DETAIL: load main: G.edges_to\n", " 0.77s DETAIL: load main: F.etcbc4_db_otype [node] \n", " 1.36s DETAIL: load main: F.etcbc4_ft_function [node] \n", " 1.47s DETAIL: load main: F.etcbc4_ft_g_word_utf8 [node] \n", " 1.74s DETAIL: load main: F.etcbc4_ft_number [node] \n", " 2.61s DETAIL: load main: F.etcbc4_ft_rela [node] \n", " 3.15s DETAIL: load main: F.etcbc4_ft_sp [node] \n", " 3.49s DETAIL: load main: F.etcbc4_ft_trailer_utf8 [node] \n", " 3.75s DETAIL: load main: F.etcbc4_sft_book [node] \n", " 3.78s DETAIL: load main: F.etcbc4_sft_chapter [node] \n", " 3.80s DETAIL: load main: F.etcbc4_sft_verse [node] \n", " 3.81s DETAIL: load main: F.etcbc4_ft_mother [e] \n", " 3.93s DETAIL: load main: C.etcbc4_ft_mother -> \n", " 4.28s DETAIL: load main: C.etcbc4_ft_mother <- \n", " 4.41s DETAIL: load annox lexicon: F.etcbc4_lex_gloss [node] \n", " 4.61s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/adjectives/__log__adjectives.txt\n", " 4.63s INFO: LOADING PREPARED data: please wait ... \n", " 4.63s prep prep: G.node_sort\n", " 4.69s prep prep: G.node_sort_inv\n", " 5.12s prep prep: L.node_up\n", " 7.93s prep prep: L.node_down\n", " 13s prep prep: V.verses\n", " 13s prep prep: V.books_la\n", " 13s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html\n", " 15s INFO: LOADED PREPARED data\n", " 15s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK adjectives AT 2016-11-18T18-41-00\n" ] } ], "source": [ "version = '4b'\n", "API = fabric.load('etcbc{}'.format(version), 'lexicon', 'adjectives', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype \n", " function rela sp\n", " gloss\n", " g_word_utf8 trailer_utf8\n", " book chapter verse number\n", " ''',\n", " '''\n", " mother\n", " '''),\n", " \"prepare\": prepare,\n", " \"primary\": False,\n", "}, verbose='DETAIL')\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Collect data\n", "\n", "We need phrases that act as a mother to one or more attributive subphrases.\n", "That means that such a subphrase must have the \n", "[rela](https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/rela.html)\n", "feature set to `atr`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us first collect subphrases having `rela = atr`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 7.65s Finding subphrases ...\n", " 8.88s 3106 attributive subphrases\n" ] } ], "source": [ "attr_subphrases = set()\n", "inf('Finding subphrases ...')\n", "for s in F.otype.s('subphrase'):\n", " if F.rela.v(s) != 'atr':\n", " continue\n", " attr_subphrases.add(s)\n", "inf('{} attributive subphrases'.format(len(attr_subphrases)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let us add the mothers to those subphrases.\n", "If there is no mother, we leave it out.\n", "A subphrase should not have multiple mothers, but we'll check that anyway." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 15s No subphrases with multiple mothers\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 15s 12 subphrases without mothers\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " 15s 3094 attributive subphrases with a single mother\n" ] } ], "source": [ "attr_subphrase_mother = dict()\n", "multiple_mothers = set()\n", "no_mothers = set()\n", "for s in attr_subphrases:\n", " mothers = list(C.mother.v(s))\n", " if len(mothers) == 0:\n", " no_mothers.add(s)\n", " continue\n", " if len(mothers) > 1: \n", " multiple_mothers.add(s)\n", " continue\n", " attr_subphrase_mother[s] = mothers[0]\n", "if len(multiple_mothers):\n", " msg('{} subphrases with multiple mothers'.format(len(multiple_mothers)))\n", "else:\n", " inf('No subphrases with multiple mothers')\n", "if len(no_mothers):\n", " msg('{} subphrases without mothers'.format(len(no_mothers)))\n", "else:\n", " inf('No subphrases without mothers')\n", "\n", "inf('{} attributive subphrases with a single mother'.format(len(attr_subphrase_mother)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us get some information about the mothers of those subphrases.\n", "What kind of objects are they?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3094 subphrases with a mother of type subphrase\n" ] } ], "source": [ "mother_types = collections.Counter()\n", "idents = 0\n", "for (s, m) in attr_subphrase_mother.items():\n", " mother_types[F.otype.v(m)] +=1\n", "\n", "for t in sorted(mother_types):\n", " print('{:>4} subphrases with a mother of type {}'.format(mother_types[t], t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the mother is always a subphrase.\n", "What about the length of that subphrase?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2085 subphrases with a mother of length 1\n", " 919 subphrases with a mother of length 2\n", " 62 subphrases with a mother of length 3\n", " 14 subphrases with a mother of length 4\n", " 11 subphrases with a mother of length 5\n", " 1 subphrases with a mother of length 7\n", " 1 subphrases with a mother of length 8\n", " 1 subphrases with a mother of length 9\n" ] } ], "source": [ "mother_length = collections.Counter()\n", "for (s, m) in attr_subphrase_mother.items():\n", " mother_length[len(L.d('word', m))] +=1\n", "\n", "for t in sorted(mother_length):\n", " print('{:>4} subphrases with a mother of length {:>2}'.format(mother_length[t], t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many nouns has the mother?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 63 subphrases with a mother having 0 nouns\n", "2867 subphrases with a mother having 1 nouns\n", " 137 subphrases with a mother having 2 nouns\n", " 12 subphrases with a mother having 3 nouns\n", " 6 subphrases with a mother having 4 nouns\n", " 8 subphrases with a mother having 5 nouns\n", " 1 subphrases with a mother having 6 nouns\n" ] } ], "source": [ "mother_nouns = collections.Counter()\n", "for (s, m) in attr_subphrase_mother.items():\n", " mother_nouns[len([w for w in L.d('word', m) if F.sp.v(w) == 'subs'])] +=1\n", "\n", "for t in sorted(mother_nouns):\n", " print('{:>4} subphrases with a mother having {:>2} nouns'.format(mother_nouns[t], t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating output\n", "\n", "Let us now assemble all data into the final output.\n", "We produce also a row of column headers." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fields = '''\n", " passage\n", " phrase_text\n", " phrase_gloss\n", " head\n", " attributive\n", " #words_mother\n", " #nouns_mother\n", "'''.strip().split()\n", "nfields = len(fields)\n", "row_template = ('{}\\t' * (nfields - 1))+'{}\\n'" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 25s Written 3095 lines to attributives_ec.csv\n", " 25s Written 3095 lines to attributives_ha.csv\n" ] } ], "source": [ "of_path_template = 'attributives_{}.csv'\n", "for fmt in ['ec', 'ha']:\n", " of = open(of_path_template.format(fmt), 'w')\n", " of.write('{}\\n'.format('\\t'.join(fields)))\n", " for s in sorted(attr_subphrase_mother, key=NK):\n", " sw = list(L.d('word', s))\n", " p = L.u('phrase', s)\n", " pw = list(L.d('word', p))\n", " m = attr_subphrase_mother[s]\n", " mw = list(L.d('word', m))\n", "\n", " of.write(row_template.format(\n", " T.passage(s),\n", " T.words(pw, fmt=fmt).replace('\\n', ' '),\n", " ' '.join(F.gloss.v(w) for w in pw),\n", " T.words(mw, fmt=fmt).replace('\\n', ' '),\n", " T.words(sw, fmt=fmt).replace('\\n', ' '),\n", " len(mw),\n", " len([w for w in mw if F.sp.v(w) == 'subs']),\n", " ))\n", "\n", " of.close()\n", " inf('Written {} lines to {}'.format(len(attr_subphrase_mother) + 1, of_path_template.format(fmt)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Results\n", "[etcbc consonantal](attributives_ec.csv)\n", "and\n", "[fully pointed hebrew](attributives_ha.csv).\n", "\n", "Screenshot made in the Numbers program:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "passage\tphrase_text\tphrase_gloss\thead\tattributive\t#words_mother\t#nouns_mother\n", "Genesis 1:8\tJWM #NJ00 \tday second\tJWM \t#NJ00 \t1\t1\n", "Genesis 1:13\tJWM #LJ#J00 \tday third\tJWM \t#LJ#J00 \t1\t1\n", "Genesis 1:16\t>T&#NJ HM>RT HGDLJM \t two the lamp the great\tHM>RT \tHGDLJM \t2\t1\n", "Genesis 1:16\t>T&HM>WR HGDL LMM#LT HJWM \t the lamp the great to dominion the day\tHM>WR \tHGDL \t2\t1\n", "Genesis 1:16\t>T&HM>WR HQVN LMM#LT HLJLH \t the lamp the small to dominion the night\tHM>WR \tHQVN \t2\t1\n", "Genesis 1:19\tJWM RBJT&HTNJNM HGDLJM W>T KL&NP# \t the sea-monster the great and whole soul\tHTNJNM \tHGDLJM \t2\t1\n", "Genesis 1:23\tJWM XMJ#J00 \tday fifth\tJWM \tXMJ#J00 \t1\t1\n", "Genesis 1:24\tNP# XJH \tsoul alive\tNP# \tXJH \t1\t1\n", "Genesis 1:30\tNP# XJH \tsoul alive\tNP# \tXJH \t1\t1\n", "Genesis 2:2\tBJWM H#BJ