{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"# Verbal valence\n",
"\n",
"*Verbal valence* is a kind of signature of a verb, not unlike overloading in programming languages.\n",
"The meaning of a verb depends on the number and kind of its complements, i.e. the linguistic entities that act as arguments for the semantic function of the verb.\n",
"\n",
"We will use a set of flowcharts to specify and compute the sense of a verb in specific contexts depending on the verbal valence. The flowcharts have been composed by Janet Dyk. Although they are not difficult to understand, it takes a good deal of ingenuity to apply them in all the real world situations that we encounter in our corpus.\n",
"\n",
"Read more in the [wiki](https://github.com/ETCBC/valence/wiki)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pipeline\n",
"See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)\n",
"for how to run this script in the pipeline."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"import sys\n",
"import os\n",
"import collections\n",
"import yaml\n",
"from copy import deepcopy\n",
"import utils\n",
"from tf.fabric import Fabric\n",
"from tf.core.helpers import formatMeta"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"if \"SCRIPT\" not in locals():\n",
" SCRIPT = False\n",
" FORCE = True\n",
" CORE_NAME = \"bhsa\"\n",
" NAME = \"valence\"\n",
" VERSION = \"c\"\n",
" CORE_MODULE = \"core\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"def stop(good=False):\n",
" if SCRIPT:\n",
" sys.exit(0 if good else 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Authors\n",
"\n",
"[Janet Dyk and Dirk Roorda](https://github.com/ETCBC/valence/wiki/Authors)\n",
"\n",
"Last modified 2017-09-13."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"[References](https://github.com/ETCBC/valence/wiki/References)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"We have carried out the valence project against the Hebrew Text Database of the BHSA, version `4b`.\n",
"See the description of the [sources](https://github.com/ETCBC/valence/wiki/Sources).\n",
"\n",
"However, we can run our stuff also against the newer versions.\n",
"\n",
"We also make use of corrected and enriched data delivered by the\n",
"[enrich notebook](enrich.ipynb).\n",
"The features of that data module are specified\n",
"[here](https://github.com/ETCBC/valence/wiki/Data)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Results\n",
"\n",
"We produce a text-fabric feature `sense` with the sense labels per verb occurrence, and add\n",
"this to the *valence* data module created in the\n",
"[enrich](enrich.ipynb) notebook.\n",
"\n",
"We also show the results in\n",
"[SHEBANQ](https://shebanq.ancient-data.org), the website of the ETCBC that exposes its Hebrew Text Database in such a way\n",
"that users can query it, save their queries, add manual annotations and even upload bulks of generated annotations.\n",
"That is exactly what we do: the valency results are visible in SHEBANQ in notes view, so that every outcome can be viewed in context."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Flowchart logic\n",
"\n",
"Valence flowchart logic translates the verb context into a label that is characteristic for the context.\n",
"You could say, it is a fingerprint of the context.\n",
"Verb meanings are complex, depending on context. It turns out that we can organize\n",
"the meaning selection of verbs around these finger prints.\n",
"\n",
"For each verb, the we can specify a *flowchart* as a mapping of fingerprints to concrete meanings.\n",
"We have flowcharts for a limited, but open set of verbs.\n",
"They are listed in the\n",
"[wiki](https://github.com/ETCBC/valence/wiki),\n",
"and will be referred to from the resulting valence annotations in SHEBANQ.\n",
"\n",
"For each verb, the flowchart is represented as a mapping of *sense labels* to meaning templates.\n",
"A sense label is a code for the presence and nature of direct objects and complements that are present in the context.\n",
"See the [legend](https://github.com/ETCBC/valence/wiki/Legend) of sense labels.\n",
"\n",
"The interesting part is the *sense template*,\n",
"which consist of a translation text augmented with placeholders for the direct objects and complements.\n",
"\n",
"See for example the flowchart of [NTN](https://github.com/ETCBC/valence/wiki/FC_NTN).\n",
"\n",
"* `{verb}` the verb occurrence in question\n",
"* `{pdos}` principal direct objects (phrase)\n",
"* `{kdos}` K-objects (phrase)\n",
"* `{ldos}` L-objects (phrase)\n",
"* `{ndos}` direct objects (phrase) (none of the above)\n",
"* `{idos}` infinitive construct (clause) objects\n",
"* `{cdos}` direct objects (clause) (none of the above)\n",
"* `{inds}` indirect objects\n",
"* `{bens}` benefactive adjuncts\n",
"* `{locs}` locatives\n",
"* `{cpls}` complements, not marked as either indirect object or locative\n",
"\n",
"In case there are multiple entities, the algorithm returns them chunked as phrases/clauses.\n",
"\n",
"Apart from the template, there is also a *status* and an optional *account*.\n",
"\n",
"The status is ``!`` in normal cases, ``?`` in dubious cases, and ``-`` in erroneous cases.\n",
"In SHEBANQ these statuses are translated into `colors` of the notes (blue/orange/red).\n",
"\n",
"The account contains information about the grounds of which the algorithm has arrived at its conclusions."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"senses = set(\n",
" \"\"\"\n",
"\n",
"CJT\n",
"DBQ\n",
"FJM\n",
"NTN\n",
"QR>\n",
"ZQN\n",
"\"\"\".strip().split()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"senseLabels = \"\"\"\n",
"--\n",
"-i\n",
"-b\n",
"-p\n",
"-c\n",
"d-\n",
"di\n",
"db\n",
"dp\n",
"dc\n",
"n.\n",
"l.\n",
"k.\n",
"i.\n",
"c.\n",
"\"\"\".strip().split()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"constKindSpecs = \"\"\"\n",
"verb:verb\n",
"dos:direct object\n",
"pdos:principal direct object\n",
"kdos:K-object\n",
"ldos:L-object\n",
"ndos:NP-object\n",
"idos:infinitive object clause\n",
"cdos:direct object clause\n",
"inds:indirect object\n",
"bens:benefactive\n",
"locs:locative\n",
"cpls:complement\n",
"\"\"\".strip().split(\n",
" \"\\n\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Results\n",
"\n",
"The complete set of results is in SHEBANQ.\n",
"It is the note set\n",
"[valence](https://shebanq.ancient-data.org/hebrew/note?version=4b&id=Mnx2YWxlbmNl&tp=txt_tb1&nget=v)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Firing up the engines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting up the context: source file and target directories\n",
"\n",
"The conversion is executed in an environment of directories, so that sources, temp files and\n",
"results are in convenient places and do not have to be shifted around."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[4]:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"repoBase = os.path.expanduser(\"~/github/etcbc\")\n",
"coreRepo = \"{}/{}\".format(repoBase, CORE_NAME)\n",
"thisRepo = \"{}/{}\".format(repoBase, NAME)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"coreTf = \"{}/tf/{}\".format(coreRepo, VERSION)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"thisSource = \"{}/source/{}\".format(thisRepo, VERSION)\n",
"thisTemp = \"{}/_temp/{}\".format(thisRepo, VERSION)\n",
"thisTempTf = \"{}/tf\".format(thisTemp)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"thisTf = \"{}/tf/{}\".format(thisRepo, VERSION)\n",
"thisNotes = \"{}/shebanq/{}\".format(thisRepo, VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[5]:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"notesFile = \"valenceNotes.csv\"\n",
"flowchartBase = \"https://github.com/ETCBC/valence/wiki\""
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"if not os.path.exists(thisNotes):\n",
" os.makedirs(thisNotes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Test\n",
"\n",
"Check whether this conversion is needed in the first place.\n",
"Only when run as a script."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[6]:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"if SCRIPT:\n",
" (good, work) = utils.mustRun(\n",
" None, \"{}/.tf/{}.tfx\".format(thisTf, \"sense\"), force=FORCE\n",
" )\n",
" if not good:\n",
" stop(good=False)\n",
" if not work:\n",
" stop(good=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading the feature data\n",
"\n",
"We load the features we need from the BHSA core database and from the valence module,\n",
"as far as generated by the\n",
"[enrich](https://github.com/ETCBC/valence/blob/master/programs/enrich.ipynb) notebook."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[7]:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"..............................................................................................\n",
". 0.00s Load the existing TF dataset .\n",
"..............................................................................................\n",
"This is Text-Fabric 9.2.0\n",
"Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n",
"\n",
"124 features found and 0 ignored\n"
]
}
],
"source": [
"utils.caption(4, \"Load the existing TF dataset\")\n",
"TF = Fabric(locations=[coreTf, thisTf], modules=[\"\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We instruct the API to load data."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[8]:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1.44s Dataset without structure sections in otext:no structure functions in the T-API\n",
" | | 1.17s C __characters__ from otext\n",
" | | 1.04s T f_correction from ~/github/etcbc/valence/tf/c\n",
" | | 1.16s T grammatical from ~/github/etcbc/valence/tf/c\n",
" | | 1.06s T lexical from ~/github/etcbc/valence/tf/c\n",
" | | 1.01s T original from ~/github/etcbc/valence/tf/c\n",
" | | 1.19s T predication from ~/github/etcbc/valence/tf/c\n",
" | | 1.03s T s_manual from ~/github/etcbc/valence/tf/c\n",
" | | 1.08s T semantic from ~/github/etcbc/valence/tf/c\n",
" | | 1.17s T valence from ~/github/etcbc/valence/tf/c\n",
" 21s All features loaded/computed - for details use TF.isLoaded()\n"
]
},
{
"data": {
"text/plain": [
"[('Computed',\n",
" 'computed-data',\n",
" ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),\n",
" ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),\n",
" ('Fabric', 'loading', ('TF',)),\n",
" ('Locality', 'locality', ('L Locality',)),\n",
" ('Nodes', 'navigating-nodes', ('N Nodes',)),\n",
" ('Features',\n",
" 'node-features',\n",
" ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),\n",
" ('Search', 'search', ('S Search',)),\n",
" ('Text', 'text', ('T Text',))]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"api = TF.load(\n",
" \"\"\"\n",
" function rela typ\n",
" g_word_utf8 trailer_utf8\n",
" lex prs uvf sp pdp ls vs vt nametype gloss\n",
" book chapter verse label number\n",
" s_manual f_correction\n",
" valence predication grammatical original lexical semantic\n",
" mother\n",
"\"\"\"\n",
")\n",
"api.makeAvailableIn(globals())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Indicators\n",
"\n",
"Here we specify by what features we recognize key constituents.\n",
"We use predominantly features that come from the correction/enrichment workflow."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[9]:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pf_`... : predication feature\n",
"`gf_`... : grammatical feature\n",
"`vf_`... : valence feature\n",
"`sf_`... : lexical feature\n",
"`of_`... : original feature"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"pf_predicate = {\n",
" \"regular\",\n",
"}\n",
"gf_direct_object = {\n",
" \"principal_direct_object\",\n",
" \"NP_direct_object\",\n",
" \"direct_object\",\n",
" \"L_object\",\n",
" \"K_object\",\n",
" \"infinitive_object\",\n",
"}\n",
"gf_indirect_object = {\n",
" \"indirect_object\",\n",
"}\n",
"gf_complement = {\n",
" \"*\",\n",
"}\n",
"sf_locative = {\n",
" \"location\",\n",
"}\n",
"sf_benefactive = {\n",
" \"benefactive\",\n",
"}\n",
"vf_locative = {\n",
" \"complement\",\n",
" \"adjunct\",\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"verbal_stems = set(\n",
" \"\"\"\n",
" qal\n",
"\"\"\".strip().split()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pronominal suffixes\n",
"We collect the information to determine how to render pronominal suffixes on words.\n",
"On verbs, they must be rendered *accusatively*, like `see him`.\n",
"But on nouns, they must be rendered *genitively*, like `hand my`.\n",
"So we make an inventory of part of speech types and the pronominal suffixes that occur on them.\n",
"On that basis we make the translation dictionaries `pronominal suffix` and `switch_prs`.\n",
"\n",
"Finally, we define a function `get_prs_info` that for each word delivers the pronominal suffix info and gloss,\n",
"if there is any, and else `(None, None)`."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[10]:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"adjv H : 16\n",
"adjv HM : 10\n",
"adjv J : 25\n",
"adjv K : 35\n",
"adjv K= : 3\n",
"adjv KM : 7\n",
"adjv M : 8\n",
"adjv MW : 1\n",
"adjv NW : 5\n",
"adjv W : 59\n",
"adjv absent : 9273\n",
"advb n/a : 4550\n",
"art n/a : 30386\n",
"conj n/a : 62722\n",
"inrg K : 1\n",
"inrg M : 2\n",
"inrg W : 5\n",
"inrg absent : 1277\n",
"intj K : 13\n",
"intj K= : 7\n",
"intj KM : 2\n",
"intj M : 37\n",
"intj NJ : 181\n",
"intj NW : 8\n",
"intj W : 3\n",
"intj absent : 1634\n",
"nega n/a : 6053\n",
"nmpr n/a : 33081\n",
"prde n/a : 2660\n",
"prep H : 1019\n",
"prep H= : 36\n",
"prep HJ : 13\n",
"prep HM : 1499\n",
"prep HN : 74\n",
"prep HW : 174\n",
"prep HWN : 19\n",
"prep J : 1853\n",
"prep K : 1634\n",
"prep K= : 353\n",
"prep KM : 1181\n",
"prep KN : 2\n",
"prep KWN : 1\n",
"prep M : 684\n",
"prep MW : 68\n",
"prep N : 3\n",
"prep N> : 4\n",
"prep NJ : 105\n",
"prep NW : 539\n",
"prep W : 3247\n",
"prep absent : 60765\n",
"prin n/a : 1021\n",
"prps n/a : 5011\n",
"subs H : 1635\n",
"subs H= : 108\n",
"subs HJ : 58\n",
"subs HM : 1417\n",
"subs HN : 114\n",
"subs HW : 340\n",
"subs HWN : 32\n",
"subs J : 4332\n",
"subs K : 4362\n",
"subs K= : 744\n",
"subs KM : 1335\n",
"subs KN : 16\n",
"subs KWN : 7\n",
"subs M : 1919\n",
"subs MW : 25\n",
"subs N : 29\n",
"subs N> : 3\n",
"subs NJ : 19\n",
"subs NW : 809\n",
"subs W : 7653\n",
"subs absent : 96548\n",
"verb H : 682\n",
"verb H= : 17\n",
"verb HJ : 6\n",
"verb HM : 121\n",
"verb HN : 4\n",
"verb HW : 1097\n",
"verb J : 356\n",
"verb K : 1089\n",
"verb K= : 201\n",
"verb KM : 132\n",
"verb KN : 1\n",
"verb KWN : 2\n",
"verb M : 1288\n",
"verb MW : 23\n",
"verb N : 15\n",
"verb N> : 3\n",
"verb NJ : 1016\n",
"verb NW : 274\n",
"verb W : 938\n",
"verb absent : 66445\n"
]
}
],
"source": [
"prss = collections.defaultdict(lambda: collections.defaultdict(lambda: 0))\n",
"for w in F.otype.s(\"word\"):\n",
" prss[F.sp.v(w)][F.prs.v(w)] += 1\n",
"if not SCRIPT:\n",
" for sp in sorted(prss):\n",
" for prs in sorted(prss[sp]):\n",
" print(\"{:<5} {:<3} : {:>5}\".format(sp, prs, prss[sp][prs]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[11]:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"pronominal_suffix = {\n",
" \"accusative\": {\n",
" \"W\": (\"p3-sg-m\", \"him\"),\n",
" \"K\": (\"p2-sg-m\", \"you:m\"),\n",
" \"J\": (\"p1-sg-\", \"me\"),\n",
" \"M\": (\"p3-pl-m\", \"them:mm\"),\n",
" \"H\": (\"p3-sg-f\", \"her\"),\n",
" \"HM\": (\"p3-pl-m\", \"them:mm\"),\n",
" \"KM\": (\"p2-pl-m\", \"you:mm\"),\n",
" \"NW\": (\"p1-pl-\", \"us\"),\n",
" \"HW\": (\"p3-sg-m\", \"him\"),\n",
" \"NJ\": (\"p1-sg-\", \"me\"),\n",
" \"K=\": (\"p2-sg-f\", \"you:f\"),\n",
" \"HN\": (\"p3-pl-f\", \"them:ff\"),\n",
" \"MW\": (\"p3-pl-m\", \"them:mm\"),\n",
" \"N\": (\"p3-pl-f\", \"them:ff\"),\n",
" \"KN\": (\"p2-pl-f\", \"you:ff\"),\n",
" },\n",
" \"genitive\": {\n",
" \"W\": (\"p3-sg-m\", \"his\"),\n",
" \"K\": (\"p2-sg-m\", \"your:m\"),\n",
" \"J\": (\"p1-sg-\", \"my\"),\n",
" \"M\": (\"p3-pl-m\", \"their:mm\"),\n",
" \"H\": (\"p3-sg-f\", \"her\"),\n",
" \"HM\": (\"p3-pl-m\", \"their:mm\"),\n",
" \"KM\": (\"p2-pl-m\", \"your:mm\"),\n",
" \"NW\": (\"p1-pl-\", \"our\"),\n",
" \"HW\": (\"p3-sg-m\", \"his\"),\n",
" \"NJ\": (\"p1-sg-\", \"my\"),\n",
" \"K=\": (\"p2-sg-f\", \"your:f\"),\n",
" \"HN\": (\"p3-pl-f\", \"their:ff\"),\n",
" \"MW\": (\"p3-pl-m\", \"their:mm\"),\n",
" \"N\": (\"p3-pl-f\", \"their:ff\"),\n",
" \"KN\": (\"p2-pl-f\", \"your:ff\"),\n",
" },\n",
"}\n",
"switch_prs = dict(\n",
" subs=\"genitive\",\n",
" verb=\"accusative\",\n",
" prep=\"accusative\",\n",
" conj=None,\n",
" nmpr=None,\n",
" art=None,\n",
" adjv=\"genitive\",\n",
" nega=None,\n",
" prps=None,\n",
" advb=None,\n",
" prde=None,\n",
" intj=\"accusative\",\n",
" inrg=\"genitive\",\n",
" prin=None,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"def get_prs_info(w):\n",
" sp = F.sp.v(w)\n",
" prs = F.prs.v(w)\n",
" switch = switch_prs[sp]\n",
" return pronominal_suffix.get(switch, {}).get(prs, (None, None))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Making a verb-clause index\n",
"\n",
"We generate an index which gives for each verb lexeme a list of clauses that have that lexeme as the main verb.\n",
"In the index we store the clause node together with the word node(s) that carries the main verb(s).\n",
"\n",
"Clauses may have multiple verbs. In many cases it is a copula plus an other verb.\n",
"In those cases, we are interested in the other verb, so we exclude copulas.\n",
"\n",
"Yet, there are also sentences with more than one main verb.\n",
"In those cases, we treat both verbs separately as main verb of one and the same clause."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[12]:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"..............................................................................................\n",
". 1m 01s Making the verb-clause index .\n",
"..............................................................................................\n"
]
}
],
"source": [
"utils.caption(4, \"Making the verb-clause index\")\n",
"occs = collections.defaultdict(\n",
" list\n",
") # dictionary of all verb occurrence nodes per verb lexeme\n",
"verb_clause = collections.defaultdict(\n",
" list\n",
") # dictionary of all verb occurrence nodes per clause node\n",
"clause_verb = (\n",
" collections.OrderedDict()\n",
") # idem but for the occurrences of selected verbs"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| 1m 03s \tDone (69439 clauses)\n"
]
}
],
"source": [
"for w in F.otype.s(\"word\"):\n",
" if F.sp.v(w) != \"verb\":\n",
" continue\n",
" lex = F.lex.v(w).rstrip(\"[\")\n",
" pf = F.predication.v(L.u(w, \"phrase\")[0])\n",
" if pf in pf_predicate:\n",
" cn = L.u(w, \"clause\")[0]\n",
" clause_verb.setdefault(cn, []).append(w)\n",
" verb_clause[lex].append((cn, w))\n",
"utils.caption(0, \"\\tDone ({} clauses)\".format(len(clause_verb)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# (Indirect) Objects, Locatives, Benefactives"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[13]:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"..............................................................................................\n",
". 1m 03s Finding key constituents .\n",
"..............................................................................................\n"
]
}
],
"source": [
"utils.caption(4, \"Finding key constituents\")\n",
"constituents = collections.defaultdict(lambda: collections.defaultdict(set))\n",
"ckinds = \"\"\"\n",
" dos pdos ndos kdos ldos idos cdos inds locs cpls bens\n",
"\"\"\".strip().split()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# go through all relevant clauses and collect all types of direct objects\n",
"for c in clause_verb:\n",
" these_constituents = collections.defaultdict(set)\n",
" # phrase like constituents\n",
" for p in L.d(c, \"phrase\"):\n",
" gf = F.grammatical.v(p)\n",
" of = F.original.v(p)\n",
" sf = F.semantic.v(p)\n",
" vf = F.valence.v(p)\n",
" ckind = None\n",
" if gf in gf_direct_object:\n",
" if gf == \"principal_direct_object\":\n",
" ckind = \"pdos\"\n",
" elif gf == \"NP_direct_object\":\n",
" ckind = \"ndos\"\n",
" elif gf == \"L_object\":\n",
" ckind = \"ldos\"\n",
" elif gf == \"K_object\":\n",
" ckind = \"kdos\"\n",
" else:\n",
" ckind = \"dos\"\n",
" elif gf in gf_indirect_object:\n",
" ckind = \"inds\"\n",
" elif sf and sf in sf_benefactive:\n",
" ckind = \"bens\"\n",
" elif sf in sf_locative and vf in vf_locative:\n",
" ckind = \"locs\"\n",
" elif gf in gf_complement:\n",
" ckind = \"cpls\"\n",
" if ckind:\n",
" these_constituents[ckind].add(p)\n",
"\n",
" # clause like constituents: only look for object clauses dependent on this clause\n",
" for ac in L.d(L.u(c, \"sentence\")[0], \"clause\"):\n",
" dep = list(E.mother.f(ac))\n",
" if len(dep) and dep[0] == c:\n",
" gf = F.grammatical.v(ac)\n",
" ckind = None\n",
" if gf in gf_direct_object:\n",
" if gf == \"direct_object\":\n",
" ckind = \"cdos\"\n",
" elif gf == \"infinitive_object\":\n",
" ckind = \"idos\"\n",
" if ckind:\n",
" these_constituents[ckind].add(ac)\n",
"\n",
" for ckind in these_constituents:\n",
" constituents[c][ckind] |= these_constituents[ckind]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| 1m 05s \tDone, 47571 clauses with relevant constituents\n"
]
}
],
"source": [
"utils.caption(\n",
" 0, \"\\tDone, {} clauses with relevant constituents\".format(len(constituents))\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[14]:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"def makegetGloss():\n",
" if \"lex\" in F.otype.all:\n",
"\n",
" def _getGloss(w):\n",
" gloss = F.gloss.v(L.u(w, \"lex\")[0])\n",
" return \"?\" if gloss is None else gloss\n",
"\n",
" else:\n",
"\n",
" def _getGloss(w):\n",
" gloss = F.gloss.v(w)\n",
" return \"?\" if gloss is None else gloss\n",
"\n",
" return _getGloss"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"getGloss = makegetGloss()"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[15]:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"testcases = (\n",
" # 426955,\n",
" # 427654,\n",
" # 428420,\n",
" # 429412,\n",
" # 429501,\n",
" # 429862,\n",
" # 431695,\n",
" # 431893,\n",
" # 430372,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"def showcase(n):\n",
" otype = F.otype.v(n)\n",
" verseNode = L.u(n, \"verse\")[0]\n",
" place = T.sectionFromNode(verseNode)\n",
" print(\n",
" \"\"\"CASE {}={} ({}-{})\\nCLAUSE: {}\\nVERSE\\n{} {}\\nGLOSS {}\\n\"\"\".format(\n",
" n,\n",
" otype,\n",
" F.rela.v(n),\n",
" F.typ.v(n),\n",
" T.text(L.d(n, \"word\"), fmt=\"text-trans-plain\"),\n",
" \"{} {}:{}\".format(*place),\n",
" T.text(L.d(verseNode, \"word\"), fmt=\"text-trans-plain\"),\n",
" \" \".join(getGloss(w) for w in L.d(verseNode, \"word\")),\n",
" )\n",
" )\n",
" print(\"PHRASES\\n\")\n",
" for p in L.d(n, \"phrase\"):\n",
" print(\n",
" '''{} ({}-{}) {} \"{}\"'''.format(\n",
" p,\n",
" F.function.v(p),\n",
" F.typ.v(n),\n",
" T.text(L.d(p, \"word\"), fmt=\"text-trans-plain\"),\n",
" \" \".join(getGloss(w) for w in L.d(p, \"word\")),\n",
" )\n",
" )\n",
" print(\n",
" \"valence = {}; grammatical = {}; lexical = {}; semantic = {}\\n\".format(\n",
" F.valence.v(p),\n",
" F.grammatical.v(p),\n",
" F.lexical.v(p),\n",
" F.semantic.v(p),\n",
" )\n",
" )\n",
" print(\"SUBCLAUSES\\n\")\n",
" for ac in L.d(L.u(n, \"sentence\")[0], \"clause\"):\n",
" dep = list(E.mother.f(ac))\n",
" if not (len(dep) and dep[0] == n):\n",
" continue\n",
" print(\n",
" '''{} ({}-{}) {} \"{}\"'''.format(\n",
" ac,\n",
" F.rela.v(ac),\n",
" F.typ.v(ac),\n",
" T.text(L.d(ac, \"word\"), fmt=\"text-trans-plain\"),\n",
" \" \".join(getGloss(w) for w in L.d(ac, \"word\")),\n",
" )\n",
" )\n",
" print(\n",
" \"valence = {}; grammatical = {}; lexical = {}; semantic = {}\\n\".format(\n",
" F.valence.v(ac),\n",
" F.grammatical.v(ac),\n",
" F.lexical.v(ac),\n",
" F.semantic.v(ac),\n",
" )\n",
" )\n",
"\n",
" print(\"CONSTITUENTS\")\n",
" for ckind in ckinds:\n",
" print(\n",
" \"{:<4}: {}\".format(\n",
" ckind, \",\".join(str(x) for x in sorted(constituents[n][ckind]))\n",
" )\n",
" )\n",
" print(\"================\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"if not SCRIPT:\n",
" for n in testcases:\n",
" showcase(n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Overview of quantities"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[16]:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"..............................................................................................\n",
". 1m 08s Counting constituents .\n",
"..............................................................................................\n"
]
}
],
"source": [
"utils.caption(4, \"Counting constituents\")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"constituents_count = collections.defaultdict(collections.Counter)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"for c in constituents:\n",
" for ckind in ckinds:\n",
" n = len(constituents[c][ckind])\n",
" constituents_count[ckind][n] += 1"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| 1m 10s \t22375 clauses with 1 dos constituents\n",
"| 1m 10s \t25196 clauses with 0 dos constituents\n",
"| 1m 10s \t22375 clauses with a dos constituent\n",
"| 1m 10s \t 3557 clauses with 1 pdos constituents\n",
"| 1m 10s \t44014 clauses with 0 pdos constituents\n",
"| 1m 10s \t 3557 clauses with a pdos constituent\n",
"| 1m 10s \t 991 clauses with 1 ndos constituents\n",
"| 1m 10s \t46580 clauses with 0 ndos constituents\n",
"| 1m 10s \t 991 clauses with a ndos constituent\n",
"| 1m 10s \t 111 clauses with 1 kdos constituents\n",
"| 1m 10s \t47460 clauses with 0 kdos constituents\n",
"| 1m 10s \t 111 clauses with a kdos constituent\n",
"| 1m 10s \t 33 clauses with 2 ldos constituents\n",
"| 1m 10s \t 3788 clauses with 1 ldos constituents\n",
"| 1m 10s \t43750 clauses with 0 ldos constituents\n",
"| 1m 10s \t 3821 clauses with a ldos constituent\n",
"| 1m 10s \t 1 clauses with 3 idos constituents\n",
"| 1m 10s \t 18 clauses with 2 idos constituents\n",
"| 1m 10s \t 1193 clauses with 1 idos constituents\n",
"| 1m 10s \t46359 clauses with 0 idos constituents\n",
"| 1m 10s \t 1212 clauses with a idos constituent\n",
"| 1m 10s \t 1305 clauses with 1 cdos constituents\n",
"| 1m 10s \t46266 clauses with 0 cdos constituents\n",
"| 1m 10s \t 1305 clauses with a cdos constituent\n",
"| 1m 10s \t 56 clauses with 2 inds constituents\n",
"| 1m 10s \t 5223 clauses with 1 inds constituents\n",
"| 1m 10s \t42292 clauses with 0 inds constituents\n",
"| 1m 10s \t 5279 clauses with a inds constituent\n",
"| 1m 10s \t 1 clauses with 6 locs constituents\n",
"| 1m 10s \t 1 clauses with 4 locs constituents\n",
"| 1m 10s \t 16 clauses with 3 locs constituents\n",
"| 1m 10s \t 330 clauses with 2 locs constituents\n",
"| 1m 10s \t12164 clauses with 1 locs constituents\n",
"| 1m 10s \t35059 clauses with 0 locs constituents\n",
"| 1m 10s \t12512 clauses with a locs constituent\n",
"| 1m 10s \t 3 clauses with 3 cpls constituents\n",
"| 1m 10s \t 87 clauses with 2 cpls constituents\n",
"| 1m 10s \t 8704 clauses with 1 cpls constituents\n",
"| 1m 10s \t38777 clauses with 0 cpls constituents\n",
"| 1m 10s \t 8794 clauses with a cpls constituent\n",
"| 1m 10s \t 2 clauses with 2 bens constituents\n",
"| 1m 10s \t 171 clauses with 1 bens constituents\n",
"| 1m 10s \t47398 clauses with 0 bens constituents\n",
"| 1m 10s \t 173 clauses with a bens constituent\n",
"| 1m 10s \t69439 clauses\n"
]
}
],
"source": [
"for ckind in ckinds:\n",
" total = 0\n",
" for (count, n) in sorted(constituents_count[ckind].items(), key=lambda y: -y[0]):\n",
" if count:\n",
" total += n\n",
" utils.caption(\n",
" 0, \"\\t{:>5} clauses with {:>2} {:<10} constituents\".format(n, count, ckind)\n",
" )\n",
" utils.caption(\n",
" 0, \"\\t{:>5} clauses with {:>2} {:<10} constituent\".format(total, \"a\", ckind)\n",
" )\n",
"utils.caption(0, \"\\t{:>5} clauses\".format(len(clause_verb)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applying the flowchart\n",
"\n",
"We can now apply the flowchart in a straightforward manner.\n",
"\n",
"We output the results as a comma separated file that can be imported directly into SHEBANQ as a set of notes, so that the reader can check results within SHEBANQ. This has the benefit that the full context is available, and also data view can be called up easily to inspect the coding situation for each particular instance."
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[17]:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"glossHacks = {\n",
" \"XQ/\": \"law/precept\",\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"In[23]:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"def reptext(\n",
" label,\n",
" ckind,\n",
" v,\n",
" phrases,\n",
" num=False,\n",
" txt=False,\n",
" gloss=False,\n",
" textformat=\"text-trans-plain\",\n",
"):\n",
" if phrases is None:\n",
" return \"\"\n",
" phrases_rep = []\n",
" for p in sorted(phrases, key=N.sortKey):\n",
" ptext = \"[{}|\".format(F.number.v(p) if num else \"[\")\n",
" if txt:\n",
" ptext += T.text(L.d(p, \"word\"), fmt=textformat)\n",
" if gloss:\n",
" words = L.d(p, \"word\")\n",
" if ckind == \"ldos\" and F.lex.v(words[0]) == \"L\":\n",
" words = words[1:]\n",
"\n",
" wtexts = []\n",
" for w in words:\n",
" g = glossHacks.get(F.lex.v(w), getGloss(w)).replace(\n",
" \"