{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Participles in the Hebrew Bible" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## About" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This task gives an inventory of particples and their context.\n", "It is based on a request by Janet Dyk for data by which the verbal/nominal roles of particples can be studied.\n", "\n", "It is work in progress. When we started we had not yet identified the exact set of features in the database that should give us clues.\n", "So, in this notebook you'll find a number of attempts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Firing up LAF-Fabric" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We fire up the engine, collect data, format the data and write it to a tab delimited file.\n", "\n", "Then the LAF-Fabric task is completed.\n", "\n", "After that we play around a bit with the data, see how we can visualize it with the python module *pandas*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Get a LAF processor" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.3.3\n", "http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n" ] } ], "source": [ "import sys\n", "import collections\n", "from laf.fabric import LafFabric\n", "from etcbc.preprocess import prepare\n", "fabric = LafFabric()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load data for this task" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.00s INFO: USING DATA COMPILED AT: 2014-07-14T16-45-08\n", " 5.69s LOGFILE=/Users/dirk/laf-fabric-output/etcbc4/participle/__log__participle.txt\n", " 5.69s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX -- FOR TASK participle AT 2014-07-15T16-40-04\n" ] } ], "source": [ "fabric.load('etcbc4', '--', 'participle', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype\n", " book chapter verse label\n", " g_word_utf8 g_cons_utf8\n", " vt\n", " sp pdp\n", " rela typ\n", " prs vbe\n", " g_prs\n", " ''','''\n", " functional_parent\n", " '''),\n", " \"primary\": False,\n", "})\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data exploration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we investigate the values that some features take in the database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part of speech tags" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following pos-tags were found in the features ``sp`` and ``pdp``.\n", "\n", "For most word occurrences, these values coincide.\n", "Where they differ, we output both, separated by a ``~``." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "{'adjv',\n", " 'advb',\n", " 'art',\n", " 'conj',\n", " 'inrg',\n", " 'intj',\n", " 'nega',\n", " 'nmpr',\n", " 'prde',\n", " 'prep',\n", " 'prin',\n", " 'prps',\n", " 'subs',\n", " 'verb'}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poss = set()\n", "for i in NN(test=F.otype.v, values=['word']):\n", " pos = F.sp.v(i)\n", " pdpos = F.pdp.v(i)\n", " poss.add(pos)\n", " poss.add(pdpos)\n", "poss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Trees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Construct trees according to the parent relationship.\n", "Make indexes for\n", "\n", "* finding the sentence above each clause\n", "* finding the words in each clause" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 6.93s Top nodes found: 66045\n" ] } ], "source": [ "top_nodes = set(NN(test=F.otype.v, value='sentence'))\n", "msg(\"Top nodes found: {}\".format(len(top_nodes)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clause constituent relations" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 11s 10000 trees visited\n", " 12s 20000 trees visited\n", " 13s 30000 trees visited\n", " 14s 40000 trees visited\n", " 14s 50000 trees visited\n", " 15s 60000 trees visited\n", " 16s 66045 trees visited\n" ] } ], "source": [ "nodes_seen = set()\n", "to_sentence = {}\n", "clause_words = collections.defaultdict(lambda: set())\n", "sentence_words = collections.defaultdict(lambda: set())\n", "sentence_verse = {}\n", "verse_label = None\n", "\n", "def walk_tree(node, sentence, clause):\n", " if node in nodes_seen:\n", " return\n", " \n", " nodes_seen.add(node)\n", " to_sentence[node] = sentence\n", " new_clause = clause\n", "\n", " otype = F.otype.v(node)\n", " if otype == 'clause': new_clause = node\n", " if otype == 'word':\n", " clause_words[clause].add(node)\n", " sentence_words[sentence].add(node)\n", " \n", " children = Ci.functional_parent.v(node)\n", " for child in children:\n", " walk_tree(child, sentence, new_clause)\n", "\n", "s = 0\n", "sc = 0\n", "chunk = 10000\n", "\n", "for node in NN(top_nodes | set(NN(test=F.otype.v, value='verse'))):\n", " if F.otype.v(node) == 'verse':\n", " verse_label = F.label.v(node)\n", " continue\n", " sentence_verse[node] = verse_label\n", " nodes_seen = set()\n", " walk_tree(node, node, None)\n", " s += 1\n", " sc += 1\n", " if sc == chunk:\n", " msg(\"{} trees visited\".format(s))\n", " sc = 0\n", " \n", "msg(\"{} trees visited\".format(s))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PrAd: 1 x\n", "Subj: 436 x\n", "Objc: 1347 x\n", "Resu: 1193 x\n", "Adju: 5872 x\n", "RgRc: 198 x\n", "Cmpl: 241 x\n", "CoVo: 305 x\n", "Spec: 41 x\n", "Attr: 5930 x\n", "NA: 69400 x\n", "PreC: 156 x\n", "Coor: 2858 x\n", "\n", "\n", "Adju GEN 01,16 לְ הָאִ֖יר עַל הָ אָֽרֶץ\n", " וַ יִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּ רְקִ֣יעַ הַ שָּׁמָ֑יִם לְ הָאִ֖יר עַל הָ אָֽרֶץ וְ לִ מְשֹׁל֙ בַּ יֹּ֣ום וּ בַ לַּ֔יְלָה וּֽ לֲ הַבְדִּ֔יל בֵּ֥ין הָ אֹ֖ור וּ בֵ֣ין הַ חֹ֑שֶׁךְ\n", "Adju GEN 01,15 לְ הָאִ֖יר עַל הָ אָ֑רֶץ\n", " וְ הָי֤וּ לִ מְאֹורֹת֙ בִּ רְקִ֣יעַ הַ שָּׁמַ֔יִם לְ הָאִ֖יר עַל הָ אָ֑רֶץ\n", "Adju GEN 01,14 לְ הַבְדִּ֕יל בֵּ֥ין הַ יֹּ֖ום וּ בֵ֣ין הַ לָּ֑יְלָה\n", " יְהִ֤י מְאֹרֹת֙ בִּ רְקִ֣יעַ הַ שָּׁמַ֔יִם לְ הַבְדִּ֕יל בֵּ֥ין הַ יֹּ֖ום וּ בֵ֣ין הַ לָּ֑יְלָה\n", "Attr GEN 01,07 אֲשֶׁ֖ר מֵ עַ֣ל לָ רָקִ֑יעַ\n", " וַ יַּבְדֵּ֗ל בֵּ֤ין הַ מַּ֨יִם֙ אֲשֶׁר֙ מִ תַּ֣חַת לָ רָקִ֔יעַ וּ בֵ֣ין הַ מַּ֔יִם אֲשֶׁ֖ר מֵ עַ֣ל לָ רָקִ֑יעַ\n", "Attr GEN 01,11 מַזְרִ֣יעַ זֶ֔רַע\n", " תַּֽדְשֵׁ֤א הָ אָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב מַזְרִ֣יעַ זֶ֔רַע עֵ֣ץ פְּרִ֞י עֹ֤שֶׂה פְּרִי֙ לְ מִינֹ֔ו אֲשֶׁ֥ר זַרְעֹו בֹ֖ו עַל הָ אָ֑רֶץ\n", "Attr GEN 01,07 אֲשֶׁר֙ מִ תַּ֣חַת לָ רָקִ֔יעַ\n", " וַ יַּבְדֵּ֗ל בֵּ֤ין הַ מַּ֨יִם֙ אֲשֶׁר֙ מִ תַּ֣חַת לָ רָקִ֔יעַ וּ בֵ֣ין הַ מַּ֔יִם אֲשֶׁ֖ר מֵ עַ֣ל לָ רָקִ֑יעַ\n", "Cmpl GEN 13,16 לִ מְנֹות֙ אֶת עֲפַ֣ר הָ אָ֔רֶץ\n", " וְ שַׂמְתִּ֥י אֶֽת זַרְעֲךָ֖ כַּ עֲפַ֣ר הָ אָ֑רֶץ אֲשֶׁ֣ר אִם יוּכַ֣ל אִ֗ישׁ לִ מְנֹות֙ אֶת עֲפַ֣ר הָ אָ֔רֶץ גַּֽם זַרְעֲךָ֖ יִמָּנֶֽה\n", "Cmpl GEN 13,06 לָ שֶׁ֥בֶת יַחְדָּֽו\n", " וְ לֹ֥א יָֽכְל֖וּ לָ שֶׁ֥בֶת יַחְדָּֽו\n", "Cmpl GEN 04,07 אִם תֵּיטִיב֙\n", " הֲ לֹ֤וא אִם תֵּיטִיב֙ שְׂאֵ֔ת\n", "CoVo GEN 16,08 אֵֽי מִ זֶּ֥ה בָ֖את\n", " הָגָ֞ר שִׁפְחַ֥ת שָׂרַ֛י אֵֽי מִ זֶּ֥ה בָ֖את וְ אָ֣נָה תֵלֵ֑כִי\n", "CoVo GEN 15,02 מַה תִּתֶּן לִ֔י\n", " אֲדֹנָ֤י יֱהוִה֙ מַה תִּתֶּן לִ֔י וְ אָנֹכִ֖י הֹולֵ֣ךְ עֲרִירִ֑י\n", "CoVo GEN 23,11 שְׁמָעֵ֔נִי\n", " לֹֽא אֲדֹנִ֣י שְׁמָעֵ֔נִי\n", "Coor GEN 01,16 וְ לִ מְשֹׁל֙ בַּ יֹּ֣ום וּ בַ לַּ֔יְלָה\n", " וַ יִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּ רְקִ֣יעַ הַ שָּׁמָ֑יִם לְ הָאִ֖יר עַל הָ אָֽרֶץ וְ לִ מְשֹׁל֙ בַּ יֹּ֣ום וּ בַ לַּ֔יְלָה וּֽ לֲ הַבְדִּ֔יל בֵּ֥ין הָ אֹ֖ור וּ בֵ֣ין הַ חֹ֑שֶׁךְ\n", "Coor GEN 01,16 וּֽ לֲ הַבְדִּ֔יל בֵּ֥ין הָ אֹ֖ור וּ בֵ֣ין הַ חֹ֑שֶׁךְ\n", " וַ יִּתֵּ֥ן אֹתָ֛ם אֱלֹהִ֖ים בִּ רְקִ֣יעַ הַ שָּׁמָ֑יִם לְ הָאִ֖יר עַל הָ אָֽרֶץ וְ לִ מְשֹׁל֙ בַּ יֹּ֣ום וּ בַ לַּ֔יְלָה וּֽ לֲ הַבְדִּ֔יל בֵּ֥ין הָ אֹ֖ור וּ בֵ֣ין הַ חֹ֑שֶׁךְ\n", "Coor GEN 02,04 וְ כָל עֵ֥שֶׂב הַ שָּׂדֶ֖ה טֶ֣רֶם יִצְמָ֑ח\n", " בְּ יֹ֗ום עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְ שָׁמָֽיִם וְ כֹ֣ל שִׂ֣יחַ הַ שָּׂדֶ֗ה טֶ֚רֶם יִֽהְיֶ֣ה בָ אָ֔רֶץ וְ כָל עֵ֥שֶׂב הַ שָּׂדֶ֖ה טֶ֣רֶם יִצְמָ֑ח\n", "NA GEN 01,01 בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַ שָּׁמַ֖יִם וְ אֵ֥ת הָ אָֽרֶץ\n", " בְּ רֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַ שָּׁמַ֖יִם וְ אֵ֥ת הָ אָֽרֶץ\n", "NA GEN 01,02 וְ הָ אָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָ בֹ֔הוּ\n", " וְ הָ אָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָ בֹ֔הוּ\n", "NA GEN 01,02 וְ חֹ֖שֶׁךְ עַל פְּנֵ֣י תְהֹ֑ום\n", " וְ חֹ֖שֶׁךְ עַל פְּנֵ֣י תְהֹ֑ום\n", "Objc GEN 01,12 כִּי טֹֽוב\n", " וַ יַּ֥רְא אֱלֹהִ֖ים כִּי טֹֽוב\n", "Objc GEN 01,04 כִּי טֹ֑וב\n", " וַ יַּ֧רְא אֱלֹהִ֛ים אֶת הָ אֹ֖ור כִּי טֹ֑וב\n", "Objc GEN 01,10 כִּי טֹֽוב\n", " וַ יַּ֥רְא אֱלֹהִ֖ים כִּי טֹֽוב\n", "PrAd DAN 12,01 כָּת֥וּב בַּ סֵּֽפֶר\n", " וּ בָ עֵ֤ת הַ הִיא֙ יִמָּלֵ֣ט עַמְּךָ֔ כָּל הַ נִּמְצָ֖א כָּת֥וּב בַּ סֵּֽפֶר\n", "PreC GEN 02,13 הַ סֹּובֵ֔ב אֵ֖ת כָּל אֶ֥רֶץ כּֽוּשׁ\n", " ה֣וּא הַ סֹּובֵ֔ב אֵ֖ת כָּל אֶ֥רֶץ כּֽוּשׁ\n", "PreC GEN 02,11 הַ סֹּבֵ֗ב אֵ֚ת כָּל אֶ֣רֶץ הַֽ חֲוִילָ֔ה\n", " ה֣וּא הַ סֹּבֵ֗ב אֵ֚ת כָּל אֶ֣רֶץ הַֽ חֲוִילָ֔ה אֲשֶׁר שָׁ֖ם הַ זָּהָֽב\n", "PreC GEN 02,14 הַֽ הֹלֵ֖ךְ קִדְמַ֣ת אַשּׁ֑וּר\n", " ה֥וּא הַֽ הֹלֵ֖ךְ קִדְמַ֣ת אַשּׁ֑וּר\n", "Resu GEN 02,04 וְ כֹ֣ל שִׂ֣יחַ הַ שָּׂדֶ֗ה טֶ֚רֶם יִֽהְיֶ֣ה בָ אָ֔רֶץ\n", " בְּ יֹ֗ום עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְ שָׁמָֽיִם וְ כֹ֣ל שִׂ֣יחַ הַ שָּׂדֶ֗ה טֶ֚רֶם יִֽהְיֶ֣ה בָ אָ֔רֶץ וְ כָל עֵ֥שֶׂב הַ שָּׂדֶ֖ה טֶ֣רֶם יִצְמָ֑ח\n", "Resu GEN 02,14 ה֥וּא פְרָֽת\n", " וְ הַ נָּהָ֥ר הָֽ רְבִיעִ֖י ה֥וּא פְרָֽת\n", "Resu GEN 02,17 לֹ֥א תֹאכַ֖ל מִמֶּ֑נּוּ\n", " וּ מֵ עֵ֗ץ הַ דַּ֨עַת֙ טֹ֣וב וָ רָ֔ע לֹ֥א תֹאכַ֖ל מִמֶּ֑נּוּ\n", "RgRc GEN 02,17 אֲכָלְךָ֥ מִמֶּ֖נּוּ\n", " כִּ֗י בְּ יֹ֛ום אֲכָלְךָ֥ מִמֶּ֖נּוּ מֹ֥ות תָּמֽוּת\n", "RgRc GEN 02,04 עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְ שָׁמָֽיִם\n", " בְּ יֹ֗ום עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְ שָׁמָֽיִם וְ כֹ֣ל שִׂ֣יחַ הַ שָּׂדֶ֗ה טֶ֚רֶם יִֽהְיֶ֣ה בָ אָ֔רֶץ וְ כָל עֵ֥שֶׂב הַ שָּׂדֶ֖ה טֶ֣רֶם יִצְמָ֑ח\n", "RgRc GEN 03,05 אֲכָלְכֶ֣ם מִמֶּ֔נּוּ\n", " כִּ֚י יֹדֵ֣עַ אֱלֹהִ֔ים כִּ֗י בְּ יֹום֙ אֲכָלְכֶ֣ם מִמֶּ֔נּוּ וְ נִפְקְח֖וּ עֵֽינֵיכֶ֑ם וִ הְיִיתֶם֙ כֵּֽ אלֹהִ֔ים יֹדְעֵ֖י טֹ֥וב וָ רָֽע\n", "Spec GEN 48,07 לָ בֹ֣א אֶפְרָ֑תָה\n", " וַ אֲנִ֣י בְּ בֹאִ֣י מִ פַּדָּ֗ן מֵ֩תָה֩ עָלַ֨י רָחֵ֜ל בְּ אֶ֤רֶץ כְּנַ֨עַן֙ בַּ דֶּ֔רֶךְ בְּ עֹ֥וד כִּבְרַת אֶ֖רֶץ לָ בֹ֣א אֶפְרָ֑תָה\n", "Spec GEN 42,06 ה֥וּא\n", " וְ יֹוסֵ֗ף ה֚וּא הַ שַּׁלִּ֣יט עַל הָ אָ֔רֶץ ה֥וּא הַ מַּשְׁבִּ֖יר לְ כָל עַ֣ם הָ אָ֑רֶץ\n", "Spec EXO 35,09 לְ שָׁרֵ֣ת בַּ קֹּ֑דֶשׁ\n", " וְ כָל חֲכַם לֵ֖ב בָּכֶ֑ם יָבֹ֣אוּ וְ יַעֲשׂ֔וּ אֵ֛ת כָּל אֲשֶׁ֥ר צִוָּ֖ה יְהוָֽה אֶת הַ֨ מִּשְׁכָּ֔ן אֶֽת אָהֳלֹ֖ו וְ אֶת מִכְסֵ֑הוּ אֶת קְרָסָיו֙ וְ אֶת קְרָשָׁ֔יו אֶת ֯בריחו אֶת עַמֻּדָ֖יו וְ אֶת אֲדָנָֽיו אֶת הָ אָרֹ֥ן וְ אֶת בַּדָּ֖יו אֶת הַ כַּפֹּ֑רֶת וְ אֵ֖ת פָּרֹ֥כֶת הַ מָּסָֽךְ אֶת הַ שֻּׁלְחָ֥ן וְ אֶת בַּדָּ֖יו וְ אֶת כָּל כֵּלָ֑יו וְ אֵ֖ת לֶ֥חֶם הַ פָּנִֽים וְ אֶת מְנֹרַ֧ת הַ מָּאֹ֛ור וְ אֶת כֵּלֶ֖יהָ וְ אֶת נֵרֹתֶ֑יהָ וְ אֵ֖ת שֶׁ֥מֶן הַ מָּאֹֽור וְ אֶת מִזְבַּ֤ח הַ קְּטֹ֨רֶת֙ וְ אֶת בַּדָּ֔יו וְ אֵת֙ שֶׁ֣מֶן הַ מִּשְׁחָ֔ה וְ אֵ֖ת קְטֹ֣רֶת הַ סַּמִּ֑ים וְ אֶת מָסַ֥ךְ הַ פֶּ֖תַח לְ פֶ֥תַח הַ מִּשְׁכָּֽן אֵ֣ת מִזְבַּ֣ח הָ עֹלָ֗ה וְ אֶת מִכְבַּ֤ר הַ נְּחֹ֨שֶׁת֙ אֲשֶׁר לֹ֔ו אֶת בַּדָּ֖יו וְ אֶת כָּל כֵּלָ֑יו אֶת הַ כִּיֹּ֖ר וְ אֶת כַּנֹּֽו אֵ֚ת קַלְעֵ֣י הֶ חָצֵ֔ר אֶת עַמֻּדָ֖יו וְ אֶת אֲדָנֶ֑יהָ וְ אֵ֕ת מָסַ֖ךְ שַׁ֥עַר הֶ חָצֵֽר אֶת יִתְדֹ֧ת הַ מִּשְׁכָּ֛ן וְ אֶת יִתְדֹ֥ת הֶ חָצֵ֖ר וְ אֶת מֵיתְרֵיהֶֽם אֶת בִּגְדֵ֥י הַ שְּׂרָ֖ד לְ שָׁרֵ֣ת בַּ קֹּ֑דֶשׁ אֶת בִּגְדֵ֤י הַ קֹּ֨דֶשׁ֙ לְ אַהֲרֹ֣ן הַ כֹּהֵ֔ן וְ אֶת בִּגְדֵ֥י בָנָ֖יו לְ כַהֵֽן\n", "Subj GEN 24,50 דַּבֵּ֥ר אֵלֶ֖יךָ רַ֥ע אֹו טֹֽוב\n", " לֹ֥א נוּכַ֛ל דַּבֵּ֥ר אֵלֶ֖יךָ רַ֥ע אֹו טֹֽוב\n", "Subj GEN 02,18 הֱיֹ֥ות הָֽ אָדָ֖ם לְ בַדֹּ֑ו\n", " לֹא טֹ֛וב הֱיֹ֥ות הָֽ אָדָ֖ם לְ בַדֹּ֑ו\n", "Subj GEN 04,17 בֹּ֣נֶה עִ֔יר\n", " וַֽ יְהִי֙ בֹּ֣נֶה עִ֔יר\n" ] } ], "source": [ "rels = collections.defaultdict(int)\n", "verse_label = None\n", "nr_of_examples = 3\n", "examples = collections.defaultdict(lambda: set())\n", "for i in NN(test=F.otype.v, values=['clause', 'verse']):\n", " if F.otype.v(i) == 'verse':\n", " verse_label = F.label.v(i)\n", " else:\n", " ccr = F.rela.v(i)\n", " rels[ccr] += 1\n", " if len(examples[ccr]) < nr_of_examples:\n", " examples[ccr].add(i)\n", "for ccr in rels:\n", " print(\"{}: {:>6} x\".format(ccr, rels[ccr]))\n", "\n", "print(\"\\n\")\n", "\n", "for ccr in sorted(examples):\n", " for clause in examples[ccr]:\n", " sentence = to_sentence[clause]\n", " cwords = sorted(clause_words[clause])\n", " swords = sorted(sentence_words[sentence])\n", " vlabel = sentence_verse[sentence]\n", "# print(\"{} in {}: {}\".format(ccr, vlabel, \" \".join([x[1] for x in P.data(i)])))\n", " print(\"{:<4} {:<10} {}\\n{:<16}{}\".format(\n", " ccr, \n", " vlabel, \n", " \" \".join([F.g_word_utf8.v(word) for word in cwords]),\n", " '',\n", " \" \".join([F.g_word_utf8.v(word) for word in swords]),\n", " ))\n", " " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ccrs = {\n", " 'Adju': 'adjunct clause',\n", " 'Attr': 'attributive clause',\n", " 'Cmpl': 'complement clause, but not subject or object',\n", " 'CoVo': 'continuation of the vocative',\n", " 'Coor': 'coordination',\n", " 'Objc': 'object clause',\n", " 'PrAd': 'predicative adjunct clause',\n", " 'PreC': 'predicative complement clause',\n", " 'Resu': 'clause after resumptive extrapolated fronted element',\n", " 'RgRc': 'Regens rectum (governing governed)',\n", " 'Spec': 'Specification clause',\n", " 'Subj': 'subject clause',\n", " 'NA': 'not known/not marked',\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subject\n", "clause that has the function of subject\n", "\n", "### Object\n", "clause that has the function of object\n", "\n", "### Complement\n", "clause that has a function of a verb complement, but not subject or object \n", "\n", "### Attributive\n", "clause that has an attributive function (often with a relative pronoun)\n", "\n", "### Adjunct\n", "clauses with additional information, usually without a finite verb\n", "\n", "### Predicative clause\n", "clause that has a predicative function\n", "\n", "### Coordination\n", "multiple dependent clauses coordinated (with and, or etc) to each other under the same head (a main clause or a phrase (asher))\n", "\n", "### Continuation of the vocative\n", "clause that follows after a vocative: *Adam*, where are you.\n", "\n", "### Resumptive\n", "King David, Nathan the prohpet spoke severly to *him* [here King David is *casus pendens* or extrapolated element.\n", "\n", "### Regens Rectum\n", "You shall reign over the birds and the animals and **all** *creeps on the face of the earth* [Here **all** governs the *reptiles*]\n", "\n", "### none\n", "No clause constituent relation marked.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phrase atom types" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'AdjP',\n", " 'AdvP',\n", " 'CP',\n", " 'DPrP',\n", " 'IPrP',\n", " 'InjP',\n", " 'InrP',\n", " 'NP',\n", " 'NegP',\n", " 'PP',\n", " 'PPrP',\n", " 'PrNP',\n", " 'VP'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pats = set()\n", "for i in NN(test=F.otype.v, values=['phrase_atom']):\n", " pat = F.typ.v(i)\n", " pats.add(pat)\n", "pats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tense" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'NA', 'impf', 'impv', 'infa', 'infc', 'perf', 'ptca', 'ptcp', 'wayq'}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tenses = set()\n", "for i in NN(test=F.otype.v, values=['word']):\n", " tense = F.vt.v(i)\n", " tenses.add(tense)\n", "tenses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pronominal suffixes (paradigmatic, graphical, plain)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "paradigmatic pronoun suffix: H, H=, HJ, HM, HN, HW, HWN, J, K, K=, KM, KN, KWN, M, MW, N, N>, NJ, NW, W, absent, n/a\n", "graphical pronoun suffix: , +, +:@K@, +:AHOM, +:AHOWN, +:AK@, +:AK@H, +:AKEM, +:AKOM, +:AKOWN, +:H@, +:HEM, +:HEN, +:HOM, +:HOWM, +:HOWN, +:HW., +:K@, +:K@H, +:KEM, +:KEN, +:KOM, +:KOWN, +:NIJ, +:NW., +;>, +;H., +;HW., +;K, +;K:, +;K;H, +;KIJ, +;M, +;MOW, +;N.@H, +;NIJ, +;NW., +>, +@>, +@H, +@H,, +@H., +@H:N@H, +@H;M, +@H;N, +@HAM, +@HEM, +@HEN, +@HW., +@K:, +@K@H, +@KEM, +@KEN@H, +@M, +@MOW, +@N, +@N@H, +@NIJ, +@NW., +A, +AH., +AJ, +AM, +AN, +AN.IJ, +AN@>, +ANIJ, +D, +EH@, +EK:, +EK@, +EK@H, +EM, +EN@H, +H, +H., +H.EM, +H;M@H, +H;N, +H>, +H@, +HEM, +HEN, +HEN@H, +HIJ, +HM, +HOM, +HOWN, +HW, +HW., +HWN, +IJ, +IK, +IK:, +J, +JNJ, +K, +K.@, +K.@H, +K.EM, +K:, +K@, +K@H, +KEM, +KEN, +KEN@H, +KIJ, +KJ, +KM, +KOWN, +M, +MOW, +MW., +N, +N>, +N@>, +NH, +NIJ, +NJ, +NW, +NW., +OH, +OW, +W, +W., +WMW, +WNJ, +WW\n" ] } ], "source": [ "ppss = set()\n", "gpss = set()\n", "for i in NN(test=F.otype.v, values=['word']):\n", " pps = F.prs.v(i)\n", " gps = F.g_prs.v(i)\n", " ppss.add(pps)\n", " gpss.add(gps)\n", " \n", "output = \"paradigmatic pronoun suffix: \"\n", "output += \", \".join(sorted(ppss))\n", "output += \"\\n\" + \"graphical pronoun suffix: \"\n", "output += \", \".join(sorted(gpss))\n", "print(output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Verbal ending (paradigmatic)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "{'',\n", " 'H',\n", " 'H=',\n", " 'J',\n", " 'JN',\n", " 'N',\n", " 'N>',\n", " 'NH',\n", " 'NW',\n", " 'T',\n", " 'T=',\n", " 'T==',\n", " 'TJ',\n", " 'TM',\n", " 'TN',\n", " 'TWN',\n", " 'W',\n", " 'WN',\n", " 'n/a'}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pves = set()\n", "for i in NN(test=F.otype.v, value='word'):\n", " pve = F.vbe.v(i)\n", " pves.add(pve)\n", "pves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conventions for all tasks" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pos_table = {\n", " 'adjv': 'aj',\n", " 'advb': 'av',\n", " 'art': 'dt',\n", " 'conj': 'cj',\n", " 'intj': 'ij',\n", " 'inrg': 'ir',\n", " 'nega': 'ng',\n", " 'subs': 'n',\n", " 'nmpr': 'n-pr',\n", " 'prep': 'pp',\n", " 'prps': 'pr-ps',\n", " 'prde': 'pr-dem',\n", " 'prin': 'pr-int',\n", " 'verb': 'vb',\n", "}\n", "\n", "pron_suffix_table = {\n", " '',\n", " 'H',\n", " 'H=',\n", " 'J',\n", " 'JN',\n", " 'N',\n", " 'N>',\n", " 'NH',\n", " 'NW',\n", " 'T',\n", " 'T=',\n", " 'T==',\n", " 'TJ',\n", " 'TM',\n", " 'TN',\n", " 'TWN',\n", " 'W',\n", " 'WN',\n", " 'n/a',\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Task: Participles in Clause Atoms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Specification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to analyse participles in their clause-*atoms*, not in their full clauses.\n", "We are particularly interested in verbal complements that these participles have in their clause-atom.\n", "\n", "Let us start with pronominal suffixes attached to the participle.\n", "\n", "We need to find all words marked with ``tense=ptca`` or ``tense=ptcp``.\n", "From there, we need all surrounding words in the same clause-*atom*.\n", "Of all words, we need the ``sp`` and the ``pdp`` features,\n", "and of the participle we need the ``prs`` as well.\n", "\n", "We output a tab delimited file.\n", "\n", "One row per participle, containing the following fields:\n", "\n", "sequence number | passage label | \n", "\n", "pos-tags of words before | pos tag of ptc | pronoun suffix (paradigmatic) | pos-tags of words after | \n", "\n", "plain text of words after | pronoun suffix (plain) | plain text of ptc | plain text of words before\n", "\n", "Every participle is shown within its clause-atom.\n", "\n", "If there are several participles in the same clause-atom, we put every participle in a separate row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Execute the task: data collection" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 1m 10s Get the participles...\n", "Genesis (354)\n", "Exodus (340)\n", "Leviticus (232)\n", "Numeri (383)\n", "Deuteronomium (435)\n", "Josua (187)\n", "Judices (227)\n", "Samuel_I (325)\n", "Samuel_II (251)\n", "Reges_I (287)\n", "Reges_II (290)\n", "Jesaia (807)\n", "Jeremia (743)\n", "Ezechiel (505)\n", "Hosea (71)\n", "Joel (28)\n", "Amos (82)\n", "Obadia (7)\n", "Jona (15)\n", "Micha (72)\n", "Nahum (48)\n", "Habakuk (28)\n", "Zephania (44)\n", "Haggai (10)\n", "Sacharia (132)\n", "Maleachi (51)\n", "Psalmi (907)\n", "Iob (217)\n", "Proverbia (490)\n", "Ruth (34)\n", "Canticum (63)\n", "Ecclesiastes (126)\n", "Threni (58)\n", "Esther (108)\n", "Daniel (335)\n", "Esra (110)\n", "Nehemia (204)\n", "Chronica_I (193)\n", "Chronica_II (355)\n", " 1m 14s Found 9664 participles in 9154 clause atoms\n" ] } ], "source": [ "msg(\"Get the participles...\")\n", "\n", "book = None\n", "chapter = None\n", "verse = None\n", "label = None\n", "\n", "found_total = 0\n", "found_in_book = 0\n", "found_total = 0\n", "clause_atoms = []\n", "current_clause = []\n", "has_participle = None\n", "\n", "for i in NN(test=F.otype.v, values=['book', 'chapter', 'verse', 'clause_atom', 'word']):\n", " otype = F.otype.v(i)\n", " if otype == 'word':\n", " tense = F.vt.v(i)\n", " is_participle = tense == 'ptca' or tense == 'ptcp'\n", " pron_suff_para = None\n", " if is_participle:\n", " has_participle = True\n", " pron_suff_para = F.prs.v(i)\n", " found_total += 1\n", " pos = pos_table[F.sp.v(i)]\n", " pdpos = pos_table[F.pdp.v(i)]\n", " rpos = pos if pos == pdpos else \"{}~{}\".format(pos, pdpos)\n", " current_clause.append((\n", " is_participle,\n", " F.g_cons_utf8.v(i),\n", " rpos,\n", " pron_suff_para,\n", " ))\n", " elif otype == 'clause_atom':\n", " if has_participle:\n", " clause_atoms.append((label, current_clause))\n", " found_in_book += 1\n", " current_clause = []\n", " has_participle = False\n", " elif otype == 'book':\n", " if book != None:\n", " msg(\"{} ({})\".format(book, found_in_book), withtime=False)\n", " found_in_book = 0\n", " book = F.book.v(i)\n", " elif otype == 'chapter':\n", " chapter = F.chapter.v(i)\n", " elif otype == 'verse':\n", " verse = F.verse.v(i)\n", " label = \"{} {}:{}\".format(book, chapter, verse)\n", "if has_participle:\n", " clause_atoms.append((label, current_clause))\n", "msg(\"{} ({})\".format(book, found_in_book), withtime=False)\n", "\n", "msg(\"Found {} participles in {} clause atoms\".format(found_total, len(clause_atoms)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Execute the task: formatting output" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "split_clause_atoms = []\n", "for (label, clause) in clause_atoms:\n", " ptcs = [n for (n, w) in enumerate(clause) if w[0]]\n", " for ptc in ptcs:\n", " split_clause_atoms.append((\n", " label,\n", " clause[0:ptc],\n", " clause[ptc],\n", " clause[ptc+1:len(clause)] if ptc < len(clause) - 1 else [],\n", " ))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 1m 20s Results directory:\n", "/Users/dirk/laf-fabric-output/etcbc4/participle\n", "\n", "__log__participle.txt 1084 Tue Jul 15 18:41:24 2014\n", "ptc_cl_atoms.csv 825100 Tue Jul 15 18:41:24 2014\n" ] } ], "source": [ "ptc_cl_atoms = outfile(\"ptc_cl_atoms.csv\")\n", "ptc_cl_atoms.write(\"n\\tpassage\\tp_pre\\tp_ptc\\tp_suff\\tp_post\\tt_post\\tt_ptc\\tt_pre\\n\")\n", "for (n, (label, pre, ptc, post)) in enumerate(split_clause_atoms):\n", " fields = [str(n+1), label]\n", " fields.append(\"|\".join([w[2] for w in pre]))\n", " fields.append(ptc[2])\n", " fields.append(ptc[3])\n", " fields.append(\"|\".join([w[2] for w in post]))\n", " fields.append(\" \".join([w[1] for w in post]))\n", " fields.append(ptc[1])\n", " fields.append(\" \".join([w[1] for w in pre]))\n", " ptc_cl_atoms.write(\"{}\\n\".format(\"\\t\".join(fields)))\n", "close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Playing with the output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all: I opened the ``ptc_tab.csv`` (a tab delimited file) in OpenOffice, and there I formatted some rows and columns, defined a region, and sorted the rows. The result I saved in ``ptc_tab.ods`` (also on GitHub, same directory as this notebook).\n", "\n", "Let's get an impression of what we've got in our tab delimited file." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas\n", "from IPython.display import display\n", "pandas.set_option('display.notebook_repr_html', True)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
npassagep_prep_ptcp_suffp_postt_postt_ptct_pre
0 1 Genesis 1:3 cj|n|n vb absent pp|n|dt|n על פני ה מים מרחפת ו רוח אלהים
1 2 Genesis 1:7 cj|vb vb absent n~pp|n|pp|n בין מים ל מים מבדיל ו יהי
2 3 Genesis 1:11 vb absent n זרע מזריע
3 4 Genesis 1:11 vb absent n|pp|n פרי ל מינו עשׂה
4 5 Genesis 1:12 vb absent n|pp|n זרע ל מינהו מזריע
5 6 Genesis 1:12 vb absent n פרי עשׂה
6 7 Genesis 1:21 dt~cj vb absent רמשׂת ה
7 8 Genesis 1:27 dt~cj vb absent pp|dt|n על ה ארץ רמשׂ ה
8 9 Genesis 1:29 dt~cj vb absent pp|dt|n על ה ארץ רמשׂת ה
9 10 Genesis 1:29 vb absent n זרע זרע
\n", "

10 rows × 9 columns

\n", "
" ], "text/plain": [ " n passage p_pre p_ptc p_suff p_post t_post t_ptc \\\n", "0 1 Genesis 1:3 cj|n|n vb absent pp|n|dt|n על פני ה מים מרחפת \n", "1 2 Genesis 1:7 cj|vb vb absent n~pp|n|pp|n בין מים ל מים מבדיל \n", "2 3 Genesis 1:11 vb absent n זרע מזריע \n", "3 4 Genesis 1:11 vb absent n|pp|n פרי ל מינו עשׂה \n", "4 5 Genesis 1:12 vb absent n|pp|n זרע ל מינהו מזריע \n", "5 6 Genesis 1:12 vb absent n פרי עשׂה \n", "6 7 Genesis 1:21 dt~cj vb absent רמשׂת \n", "7 8 Genesis 1:27 dt~cj vb absent pp|dt|n על ה ארץ רמשׂ \n", "8 9 Genesis 1:29 dt~cj vb absent pp|dt|n על ה ארץ רמשׂת \n", "9 10 Genesis 1:29 vb absent n זרע זרע \n", "\n", " t_pre \n", "0 ו רוח אלהים \n", "1 ו יהי \n", "2 \n", "3 \n", "4 \n", "5 \n", "6 ה \n", "7 ה \n", "8 ה \n", "9 \n", "\n", "[10 rows x 9 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table_file = my_file('ptc_cl_atoms.csv')\n", "df = pandas.read_csv(table_file, sep=\"\\t\", keep_default_na=False, na_values=[])\n", "df.head(10)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }