{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Heads 馃樁\n", "## By Cody Kingham, in collaboration with Christiaan Erwich\n", "\n", "## Problem Description\n", "The ETCBC's BHSA core data does not contain the standard syntax tree format. This also means that syntactic and functional relationships between individual words are not mapped in a transparent or easily accessible way. In some cases, fine-grained relationships are ignored altogether. For example, for a given noun phrase (NP), there is no explicit way of obtaining its head noun (i.e. the noun itself without any modifying elements). This causes numerous problems for research in the realm of semantics. For instance, it is currently very difficult to calculate the complete person, gender, and number (PGN) of a given subject phrase. That is because PGN is stored at the word level only. But this is a very inadequate representation. Phrases in the ETCBC often contain coordinate relationships within the phrase. So even if one selects the first \"noun\" in the phrase and checks for its PGN value, they may overlook the presence of another noun which makes the phrase plural. Ideally, the phrase itself would have a PGN feature. But before this kind of data is created, it is necessary to separate the head words of a phrase from their modifying elements such as adjectives, determiners, or nouns in construct (genitive) relations.\n", "\n", "A head word can be defined as the word for which a phrase type is named after. Examples of phrase types are \"NP\" (noun phrase) or \"VP\" (verb phrase). In this notebook, we experiment with and build the functions stored in `heads.py` in order to export a set of Text-Fabric edge features. The edge features represent a mapping from a phrase node to its head element. \n", "\n", "This goal requires us to think carefully about the way inter-word, semantic relations are reflected in the ETCBC's data. The ETCBC *does* contain some rudimentary semantic embeddings through the so-called [subphrase](https://etcbc.github.io/bhsa/features/hebrew/c/otype). These can be utilized to isolate head words from secondary elements. A subphrase should *not* be thought of as a smaller, embedded phrase, like the ETCBC's phrase-atom (though it sometimes must indadequately fill that role). Rather, the subphrase is a way to encode relationships between words below the level of a phrase(atom), hence \"sub.\" A subphrase can be a single word, or it can be a collection of words. A word can be in multiple subphrases, but can not be in more than 3 (due to the limitations of the data creation program, [parsephrases](http://www.etcbc.nl/datacreation/#ps3.p)).\n", "\n", "## Method\n", "The types of phrases represented in the ETCBC include `NP` (noun phrase), `VP` (verb phrase), `PrNP` (proper noun phrase), `PP` (prepositional phrase), `AdvP` (adverbial phrase), and [eight others](https://etcbc.github.io/bhsa/features/hebrew/c/typ). For some of these types, isolating the head word is a simple affair. By coordinating a word's phrase-dependent part of speech with its enclosing phrase's type, one can identify the head word. For a `VP`, that would mean simply finding the word within the phrase that has a `pdp` (phrase dependent part of speech) value of `verb`. Or for a prepositional phrase, find the word with a `pdp` of `prep`.\n", "\n", "The `NP` and `PrNP`, on the other hand, present special challenges. These phrases often contain multiple words with a modifying relation to the head noun. An example of this is the construct relation (e.g. \"Son of Jacob\"). The problem becomes particularly thorny when relations like the construct are chained together so that one is faced with the choice between multiple potential head nouns.\n", "\n", "To navigate the problem, we must use the feature [rela (relationship)](https://etcbc.github.io/bhsa/features/hebrew/c/rela) stored on `subphrase`s in addition to the `pdp` and phrase `type` features. In order to isolate the head word of a `NP`, we look for a word within the phrase that has a `pdp` value of `subs` (i.e. noun). We then obtain a list of all the `subphrase`s which contain that word using the [L.u Text-Fabric method](https://github.com/Dans-labs/text-fabric/wiki/Api#locality). We then use the list of subphrase node numbers to create a list of all subphrase relations containing the word. If the list contains *any* dependent relations, then the word is automatically excluded from being a head word and we can move on to the next candidate. One final check is required for candidate words at the level of the `phrase`: the same procedure described above for `subphrase`s must be performed for `phrase_atom` relations. This means excluding words within a `phrase_atom` with a dependent relation to another `phrase_atom` within the `phrase`. If the head of a *`phrase_atom`* is being calculated, this step is not necessary.\n", "\n", "There are only two possible `subphrase` or `phrase_atom` relations for a valid head word: `NA` or `par`/`Para` (the verb is an exception, which in a handful of cases does have a construct relation). `NA` means that no relation is reflected. The word is independent. The `par` (`subphrase`) and `Para` (`phrase_atom`) stands for parallel relations, i.e. coordinates. While coordinates are not formally the head, they are often an important part of how the grammatical and semantic relations are built. Thus we provide coordinates alongside the head noun. These words require one further test, that is, it must be verified that their mother (using the [edge feature](https://github.com/Dans-labs/text-fabric/wiki/Api#edge-features) \"[mother](https://etcbc.github.io/bhsa/features/hebrew/c/mother.html)\") is itself a head word. To do this step thus requires us to keep track of those words within the phrase which have been validated. We can do so with a simple list.\n", "\n", "## Results\n", "The function `get_heads` produces head word nodes on supplied phrase(atom) nodes. The results have been manually inspected for consistency.\n", "\n", "For phrase types other than the noun phrase, the results are very accurate. Some phrase types, like the conjunction phrase, do have unexpected forms. For instance, the phrase 讘注讘讜专 is coded as a conjunction phrase in the BHSA; in it, there is actually no word with a part of speech of \"conjunction\". These kinds of cases are easily accounted for by making exceptions in the set of acceptable parts of speech.\n", "\n", "For noun phrases, the situation is different. In the majority of cases, the results are good. But there are a handful of cases that simply cannot be addressed using the current ETCBC data model without a solution that exceeds the bounds of this current project. The reason is that the current model does not transparently encode hierarchy between phrases and embeded phrases. For instance, both phrase atoms and subphrases have *some* overlapping features. But what is the relationship of a phrase atom to a subphrase? Or, what is the relation of one subphrase to another? These are only coded implicitly in the data. In reality, there are \"subphrases\" embedded within the ETCBC's subphrases which are not even registered in the BHSA data. While phrase atoms receive type codes, subphrases do not. Yet, subphrases are \"phrases\" too, which should also have type codes. Another problem is that the precise level of embedding for the subphrases are not provided. Subphrases are presented as equal constituents, even though some subphrases are contained within others. These kinds of problems make a simple method, such as applied here, inadequate. But more importantly, they highlight the shortcomings of the ETCBC data model.\n", "\n", "The members of the ETCBC are aware of the inadequacy of that data model to represent complex phrases, and a change is in the pipeline to address it. However, it remains to be seen how long those changes might take. For now, the functions produced and modified in this NB will sufice to provide a temporary solution for those who require head words from BHSA phrases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Code Development\n", "\n", "Below we experiment with the code and develop the functions that will extract the head nouns. This involves a good deal of manual inspections of the results before exporting the Text-Fabric features.\n", "\n", "The code is written immediately below. Associated questions that arise while writing or evaluating the code are contained in the subsequent section." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Documentation:** BHSA Feature docs BHSA API Text-Fabric API" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "This notebook online:\n", "NBViewer\n", "GitHub\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import collections, os, sys, random\n", "from pprint import pprint\n", "from tf.fabric import Fabric\n", "from tf.extra.bhsa import Bhsa\n", "\n", "# load Text-Fabric and data\n", "data_loc = ['~/github/etcbc/bhsa/tf/c']\n", "TF = Fabric(locations=data_loc, silent=True)\n", "api = TF.load('''\n", " book chapter verse\n", " typ pdp rela mother \n", " function lex sp ls\n", " ''', silent=True)\n", "\n", "F, E, T, L = api.F, api.E, api.T, api.L # TF data methods\n", "B = Bhsa(api, name='getting_heads', version='c') # BHSA visualizer" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "def get_heads(phrase):\n", " '''\n", " Extracts and returns the heads of a supplied\n", " phrase or phrase atom based on that phrase's type\n", " and the relations reflected within the phrase.\n", " \n", " --input--\n", " phrase(atom) node number\n", " \n", " --output--\n", " tuple of head word node(s) \n", " '''\n", " \n", " # mapping from phrase type to good part of speech values for heads\n", " head_pdps = {'VP': {'verb'}, # verb \n", " 'NP': {'subs', 'adjv', 'nmpr'}, # noun \n", " 'PrNP': {'nmpr', 'subs'}, # proper-noun \n", " 'AdvP': {'advb', 'nmpr', 'subs'}, # adverbial \n", " 'PP': {'prep'}, # prepositional \n", " 'CP': {'conj', 'prep'}, # conjunctive\n", " 'PPrP': {'prps'}, # personal pronoun\n", " 'DPrP': {'prde'}, # demonstrative pronoun\n", " 'IPrP': {'prin'}, # interrogative pronoun\n", " 'InjP': {'intj'}, # interjectional\n", " 'NegP': {'nega'}, # negative\n", " 'InrP': {'inrg'}, # interrogative\n", " 'AdjP': {'adjv'} # adjective\n", " } \n", " \n", " # get phrase-head's part of speech value and list of candidate matches\n", " phrase_type = F.typ.v(phrase)\n", " head_candidates = [w for w in L.d(phrase, 'word')\n", " if F.pdp.v(w) in head_pdps[phrase_type]]\n", " \n", " # VP with verbs require no further processing, return the head verb\n", " if phrase_type == 'VP': \n", " return tuple(head_candidates)\n", " \n", " # go head-hunting!\n", " heads = []\n", " \n", " for word in head_candidates:\n", " \n", " # gather the word's subphrase (+ phrase_atom if otype is phrase) relations\n", " word_phrases = list(L.u(word, 'subphrase'))\n", " word_phrases += list(L.u(word, 'phrase_atom')) if (F.otype.v(phrase) == 'phrase') else list()\n", " word_relas = set(F.rela.v(phr) for phr in word_phrases) or {'NA'}\n", "\n", " # check (sub)phrase relations for independency\n", " if word_relas - {'NA', 'par', 'Para'}: \n", " continue\n", " \n", " # check parallel relations for independency\n", " elif word_relas & {'par', 'Para'} and mother_is_head(word_phrases, heads):\n", " this_head = find_quantified(word) or find_attributed(word) or word\n", " heads.append(this_head)\n", " \n", " # save all others as heads, check for quantifiers first\n", " elif word_relas == {'NA'}:\n", " this_head = find_quantified(word) or find_attributed(word) or word\n", " heads.append(this_head)\n", " \n", " return tuple(sorted(set(heads)))\n", " \n", "def mother_is_head(word_phrases, previous_heads):\n", " \n", " '''\n", " Test and validate parallel relationships for independency.\n", " Must gather the mother for each relation and check whether \n", " the mother contains a head word. \n", " \n", " --input--\n", " * list of phrase nodes for a given word (includes subphrases)\n", " * list of previously approved heads\n", " \n", " --output--\n", " boolean\n", " '''\n", " \n", " # get word's enclosing phrases that are parallel\n", " parallel_phrases = [ph for ph in word_phrases if F.rela.v(ph) in {'par', 'Para'}]\n", " # get the mother for the parallel phrases\n", " parallel_mothers = [E.mother.f(ph)[0] for ph in parallel_phrases] \n", " # get mothers' words, by mother\n", " parallel_mom_words = [set(L.d(mom, 'word')) for mom in parallel_mothers]\n", " # test for head in each mother\n", " test_mothers = [bool(phrs_words & set(previous_heads)) for phrs_words in parallel_mom_words] \n", " \n", " return all(test_mothers)\n", " \n", "\n", "def find_quantified(word):\n", " \n", " ''' \n", " Check whether a head candidate is a quantifier (e.g. 讻诇).\n", " If it is, find the quantified noun if there is one.\n", " Quantifiers are connected with the modified noun\n", " either by a subphrase relation of \"rec\" for nomen \n", " regens. In this case, the quantifier word node is the\n", " mother itself. In other cases, the noun is related to the\n", " number via the \"atr\" (attributive) subphrase relation. In this\n", " case, the edge relation is connected from the substantive\n", " to the number's subphrase.\n", " \n", " --input--\n", " word node\n", " \n", " --output--\n", " new word node or None\n", " '''\n", " \n", " custom_quants = {'KL/', 'M 5:\n", " \n", " heads = get_heads(phrase)\n", " \n", " examples.append(heads)\n", " \n", "len(examples)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "random.shuffle(examples) # get samples at random" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "#B.show(examples[:20]) # uncomment me" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Discovery\n", "\n", "The queries which follow were written at different times during the code construction for the heads algorithm.\n", "\n", "In this section, important questions were asked whose answers are needed to ensure the code is written correctly. The BHSA data is queried to answer them. These are questions like, \"Do we need to check for relational independency for only noun phrases?\" (no); and \"does every phrase type have a word with a corresponding pdp?\" (no).\n", "\n", "### Make definitions available for exploration:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# mapping from phrase type to its head part of speech\n", "type_to_pdp = {'VP': 'verb', # verb \n", " 'NP': 'subs', # noun \n", " 'PrNP': 'nmpr', # proper-noun \n", " 'AdvP': 'advb', # adverbial \n", " 'PP': 'prep', # prepositional \n", " 'CP': 'conj', # conjunctive\n", " 'PPrP': 'prps', # personal pronoun\n", " 'DPrP': 'prde', # demonstrative pronoun\n", " 'IPrP': 'prin', # interrogative pronoun\n", " 'InjP': 'intj', # interjectional\n", " 'NegP': 'nega', # negative\n", " 'InrP': 'inrg', # interrogative\n", " 'AdjP': 'adjv'} # adjective" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test for non-NP phrases with valid pdp but invalid head\n", "\n", "These tests demonstrate that subphrase relation checks are also needed for phrase types besides noun phrases. The only valid subphrase/phrase_atom relations for any potential head word is either `NA` or `par`/`Para`. While a few phrase types do not need additional relational checks, e.g. personal pronoun phrases, we can go ahead and consistently handle all phrases in the same way.\n", "\n", "The only exception to the above rule is the `VP`, for which there are 14 cases of the `VP`'s head word (verb) that is also in a subphrase with a regens (`rec`) relation.\n", "\n", "The operational question of these tests was:\n", "> Are there cases in which a non-NP phrase(atom) contains a word with the corresponding pdp value, but which is probably not a head?\n", "\n", "To answer the question, we first survey all cases where the phrase type's head candidate is in a subphrase with a relation that is not normally \"independent.\" Based on the survey, we manually check the most pertinent phrase types and results. The tests reveal that, indeed, relation checks are needed for many phrase types." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "def test_pdp_safe(phrase_object='phrase_atom'):\n", " \n", " '''\n", " Make a survey of phrase types and their matching pdp words,\n", " count what kinds of subphrase relations these words \n", " occurr in. The survey can then be used to investigate \n", " whether phrase types besides noun phrases require relationship\n", " checks for independency.\n", " '''\n", " \n", " pdp_relas_survey = collections.defaultdict(lambda: collections.Counter())\n", " headless = 0\n", " \n", " for phrase in F.otype.s(phrase_object):\n", "\n", " typ = F.typ.v(phrase) # phrase type\n", "\n", " # skip noun phrases\n", " if typ in {'NP', 'PrNP'}: \n", " continue\n", "\n", " head_pdp = type_to_pdp[typ]\n", "\n", " maybe_heads = [w for w in L.d(phrase, 'word') \n", " if F.pdp.v(w) == head_pdp]\n", " \n", " # this check shows that many\n", " # phrases don't have a word \n", " # with a corresponding pdp!\n", " if not maybe_heads:\n", " headless += 1\n", "\n", " # survey the candidate heads' relations\n", " for word in maybe_heads:\n", "\n", " head_name = typ + '|' + head_pdp\n", " subphrases = L.u(word, 'subphrase')\n", " sp_relas = set(F.rela.v(sp) for sp in subphrases)\\\n", " if subphrases else {'NA'} # <- handle cases without any subphrases (i.e. verbs)\n", "\n", " pdp_relas_survey[head_name].update(sp_relas)\n", "\n", " print(f'phrases without matching pdp: {headless}\\n')\n", " print('subphrase relation survey: ')\n", " for name, rela_counts in pdp_relas_survey.items():\n", "\n", " print(name)\n", "\n", " for r, count in rela_counts.items():\n", " print('\\t', r, '-', count)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "phrases without matching pdp: 837\n", "\n", "subphrase relation survey: \n", "PP|prep\n", "\t NA - 64521\n", "\t par - 3824\n", "\t adj - 42\n", "\t rec - 8\n", "VP|verb\n", "\t NA - 69011\n", "\t rec - 14\n", "\t par - 1\n", "CP|conj\n", "\t NA - 53859\n", "AdvP|advb\n", "\t NA - 5131\n", "\t par - 102\n", "\t mod - 49\n", "\t adj - 1\n", "AdjP|adjv\n", "\t NA - 1845\n", "\t par - 135\n", "\t atr - 5\n", "\t adj - 3\n", "\t rec - 1\n", "InjP|intj\n", "\t NA - 1872\n", "\t par - 11\n", "DPrP|prde\n", "\t NA - 790\n", "NegP|nega\n", "\t NA - 6743\n", "PPrP|prps\n", "\t NA - 4468\n", "\t par - 9\n", "IPrP|prin\n", "\t NA - 797\n", "\t par - 1\n", "InrP|inrg\n", "\t NA - 1288\n", "\t par - 3\n" ] } ], "source": [ "# for phrase_atoms\n", "test_pdp_safe()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "phrases without matching pdp: 670\n", "\n", "subphrase relation survey: \n", "PP|prep\n", "\t NA - 62315\n", "\t par - 3678\n", "\t adj - 42\n", "\t rec - 9\n", "VP|verb\n", "\t NA - 69011\n", "\t rec - 14\n", "\t par - 1\n", "CP|conj\n", "\t NA - 52545\n", "AdvP|advb\n", "\t NA - 5083\n", "\t par - 101\n", "\t mod - 46\n", "\t adj - 1\n", "AdjP|adjv\n", "\t NA - 1797\n", "\t par - 118\n", "\t atr - 5\n", "\t adj - 3\n", "\t rec - 1\n", "InjP|intj\n", "\t NA - 1872\n", "\t par - 11\n", "DPrP|prde\n", "\t NA - 791\n", "NegP|nega\n", "\t NA - 6743\n", "PPrP|prps\n", "\t NA - 4388\n", "\t par - 9\n", "IPrP|prin\n", "\t NA - 797\n", "\t par - 1\n", "InrP|inrg\n", "\t NA - 1288\n", "\t par - 3\n" ] } ], "source": [ "# and for phrases\n", "test_pdp_safe(phrase_object='phrase')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "^ These surveys tell us that for several of these phrase types, e.g. `InjP`, we can automatically take the word with the pdp value that corresponds with its phrase type as the head.\n", "\n", "There are also quite a few cases where the phrase type does not have a word with a matching pdp value: 837 for phrase atoms and 670 for phrases. In the subsequent section we will run tests to find out why this is the case.\n", "\n", "Back to the question of this section: There are 14 examples of VP with verbs that have a `rec` (nomen regens) relation. Are these heads or not? We check now..." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "def find_and_show(search_pattern):\n", " results = sorted(B.search(search_pattern))\n", " print(len(results), 'results')\n", " B.show(results)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# run notebook locally to see HTML-formatted results for the below searches\n", "\n", "\n", "rec_verbs = '''\n", "\n", "phrase_atom typ=VP\n", " subphrase rela=rec\n", " word pdp=verb\n", "'''\n", "\n", "#find_and_show(rec_verbs) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In all 14 results, the verb serves as the true head word of the `VP`.\n", "\n", "*Note: The verb will prove to be an exception, as all other words in a `rec` relation are not head words*\n", "\n", "The `PP` also has some strange relations. We see what's going on with the same kind of inspection. First we look at the `rec` (regens) relations." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "rec_preps = '''\n", "\n", "phrase_atom typ=PP\n", " subphrase rela=rec\n", " word pdp=prep\n", "'''\n", "\n", "#find_and_show(rec_preps) #uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The PP is different. In cases where the `phrase_atom` = `rec`, the preposition is *not* the head. Thus, the algorithm will need to check for these cases.\n", "\n", "Now for the `adj` subphrase relation in `PP`:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "adj_preps = '''\n", "\n", "phrase_atom typ=PP\n", " subphrase rela=adj\n", " word pdp=prep\n", "'''\n", "\n", "#find_and_show(adj_preps) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results above show that the `adj` subphrase relation is also a non-head. These cases have to be excluded.\n", "\n", "Now we move on to test the **adverb** relations reflected in the survey..." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "adv_adj = '''\n", "\n", "phrase_atom typ=AdvP\n", " subphrase rela=adj\n", " word pdp=advb\n", "\n", "'''\n", "\n", "#find_and_show(adv_adj) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `adj` relationships in the adverbial phrase is also not a true head. Now for the `mod` (modifier) relation." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "adv_mod = '''\n", "\n", "phrase_atom typ=AdvP\n", " subphrase rela=mod\n", " word pdp=advb\n", "\n", "'''\n", "\n", "#find_and_show(adv_mod) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, it appears that `mod` is also an invalid relation for adverb phrases. And example is 讙诐 讛诇诐 ('also here') where 讙诐 is the adverb in `mod` relation, but the head is really 讛诇诐 \"here\" (also an adverb). In several cases, the modifier modifies a verb. In these cases the \"head,\" often a participle or infinitive, acts as the adverb, even though it is not explicitly marked as such.\n", "\n", "Now we move on to the last examination, that of the `AdjP` (adjective phrase). There are three relations of interest:\n", "> atr - 6
\n", "> adj - 3
\n", "> rec - 1
" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "adj_atr = '''\n", "\n", "phrase_atom typ=AdjP\n", " subphrase rela=atr\n", " word pdp=adjv\n", "\n", "'''\n", "\n", "#find_and_show(adj_atr) # uncomment me!" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "adj_adj = '''\n", "\n", "phrase_atom typ=AdjP\n", " subphrase rela=adj\n", " word pdp=adjv\n", "\n", "'''\n", "\n", "#find_and_show(adj_adj) # uncomment me!" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "adj_rec = '''\n", "\n", "phrase_atom typ=AdjP\n", " subphrase rela=rec\n", " word pdp=adjv\n", "\n", "'''\n", "\n", "#find_and_show(adj_rec) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results for the three searches above show indeed that the relations of `atr`, `adj`, and `rec` are not head words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tests for phrase types without a word that has a valid pdp value\n", "\n", "The initial survey above revealed that 837 phrase atoms and 670 phrases lack a word with a corresponding pdp value. Here we investigate to see why that is the case. Is there a way to compensate for this problem? Are these truly phrases that lack heads?\n", "\n", "We run another survey and count the phrase types against the non-matching pdp values found within them. At this point, we must also exclude words that have dependent relations (as defined above, subphrase values of NA or parallel)." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AdvP\n", "\t nmpr - 253\n", "\t subs - 499\n", "\t art - 190\n", "\t conj - 13\n", "PrNP\n", "\t subs - 9\n", "\t art - 3\n", "CP\n", "\t prep - 85\n", "\t subs - 79\n", "\t advb - 6\n", "NP\n", "\t intj - 1\n" ] } ], "source": [ "count_no_pdp = collections.defaultdict(lambda: collections.Counter())\n", "record_no_pdp = collections.defaultdict(lambda: collections.defaultdict(list))\n", "\n", "for phrase in F.otype.s('phrase_atom'):\n", " \n", " typ = F.typ.v(phrase)\n", " \n", " # see if there is not corresponding pdp value\n", " corres_pdp = type_to_pdp[typ]\n", " corresponding_pdps = [w for w in L.d(phrase, 'word') \n", " if F.pdp.v(w) == corres_pdp]\n", " \n", " if not corresponding_pdps:\n", " \n", " # put potential heads here\n", " maybe_heads = []\n", " \n", " # calculate subphrase relations\n", " for word in L.d(phrase, 'word'):\n", " \n", " # get subphrase relations\n", " word_subphrs = L.u(word, 'subphrase')\n", " sp_relas = set(F.rela.v(sp) for sp in word_subphrs) or {'NA'}\n", " \n", " # check subphrase relations for independence\n", " if sp_relas == {'NA'}:\n", " maybe_heads.append(word)\n", " \n", " # test parallel relation for independence\n", " elif sp_relas == {'NA', 'par'} or sp_relas == {'par'}:\n", " \n", " # check for good, head mothers\n", " good_mothers = set(sp for w in maybe_heads for sp in L.u(w, 'subphrase'))\n", " this_daughter = [sp for sp in word_subphrs if F.rela.v(sp) == 'par'][0]\n", " this_mother = E.mother.f(this_daughter)\n", " \n", " if this_mother in good_mothers:\n", " maybe_heads.append(word)\n", " \n", " # sanity check\n", " # maybe_heads should have SOMETHING\n", " if not maybe_heads:\n", " raise Exception(f'phrase {phrase} looks HEADLESS!')\n", " \n", " # count pdp types\n", " head_pdps = [F.pdp.v(w) for w in maybe_heads]\n", " count_no_pdp[typ].update(head_pdps)\n", " \n", " # save for examination\n", " for word in maybe_heads:\n", " record_no_pdp[typ][F.pdp.v(word)].append((phrase, word))\n", " \n", "for name, counts in count_no_pdp.items():\n", "\n", " print(name)\n", "\n", " for pdp, count in counts.items():\n", " print('\\t', pdp, '-', count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results are a bit puzzling. The numbers here are words within the phrase atoms that have NO subphrase relations. That means, for example, words such as 讛址 \"the\" do not appear to have any subphrase relation to their modified nouns. That again illustrates the shortcoming of the ETCBC data in this respect. There should be a relation from the article to the determined noun.\n", "\n", "From this point forward, I will begin working through all four phrase types and the cases reflected in the survey.\n", "\n", "Beginning with the `AdvP` type and the article. Upon some initial inspection, I've found that in many of the `AdvP` with the article, there is also a substantive (`subs`) that was found by the search. Are there any cases where there is no `nmpr` or `subs` found alongside the article? We can use the dict `record_no_pdp` which has recorded all cases reflected in the survey. Below I look to see if all 190 cases of an article in these `AdvP` phrases also has a corresponding noun." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 without nouns found...\n" ] } ], "source": [ "no_noun = []\n", "\n", "for phrase in record_no_pdp['AdvP']['art']:\n", " \n", " pdps = set(F.pdp.v(w) for w in L.d(phrase[0], 'word'))\n", " \n", " if not {'nmpr', 'subs'} & pdps:\n", " no_noun.append((phrase,))\n", " \n", "print(len(no_noun), 'without nouns found...')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There it is. So all cases of these articles can be discarded. In these cases, the noun serves as the head of the adverbial phrase. An example of this is when the noun marks the location of the action (hence adverb). \n", "\n", "Next, we check the conjunctions found in the adverbial phrases. Are any of those heads?" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "#B.show(record_no_pdp['AdvP']['conj']) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All conjunctions in these `AdvP` phrases function to mark coordinate elements (only 讜 in these results). They can also be discarded as not possible heads.\n", "\n", "Now we investigate the `PrNP` results with `subs` and `art`..." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "#B.show(record_no_pdp['PrNP']['subs']) # uncomment me!" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "#B.show(record_no_pdp['PrNP']['art']) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `art` relations reflected in the second search are not heads, but are all related to a substantive. All of the results in `subs` are heads. Thus, the only acceptable pdp for `PrNP` besides a proper noun is `subs`.\n", "\n", "Now we dig into `CP` results. 85 of them have no `pdp` of conjunction, but have a preposition instead. Let's see what's going on..." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "#B.show(record_no_pdp['CP']['prep'][:20]) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are very interesting results. These conjunction phrases are made up of constructions like 讘+注讘讜专 and 讘+讟专诐. Together these words function as a conjunction, but alone they are prepositions and particles. Is it even possible in this case to say that there is a \"head\"?\n", "\n", "It could be said that these combinations of words mean more than the sum of their parts; they are good examples of constructions, i.e. combinations of words whose meaning cannot be inferred simply from their individual words. Constructions illustrate the vague boundary between syntax and lexicon (cf. e.g. Goldberg, 1995, *Constructions*).\n", "\n", "While these words are indeed marked as conjunction phrases, it is better in this case to analyze them as prepositional phrases (which they also are...this is another shortcoming of our data, or perhaps a mistake??). Thus, the head is the preposition, not the prepositional object.\n", "\n", "We should expect that the remaining `subs` and `advb` groups are in fact the objects of those prepositions (and hence excluded). Let's test that assumption by looking for a preposition behind these words..." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "subs|advb with no preceding prepositions: 0\n" ] } ], "source": [ "no_prep = []\n", "\n", "for (phrase, word) in record_no_pdp['CP']['subs'] + record_no_pdp['CP']['advb']:\n", " \n", " possible_prep = word - 1\n", " \n", " if F.sp.v(possible_prep) != 'prep':\n", " no_prep.append((phrase, word))\n", " \n", "print(f'subs|advb with no preceding prepositions: {len(no_prep)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see. We can confirm that none of the substantives or adverbs will be the head of a conjunction phrase. A preposition is the only other kind of head for the `CP` besides a conjunction itself.\n", "\n", "Finally, we're left with a last noun phrase (`NP`) for which no matching noun was found. The search found instead both `adjv` (adjective) and a `intj` (interjection). Let's see it." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "#B.show(record_no_pdp['NP']['intj']) # uncomment me" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the word 讗讜讬 \"woe\" functions like a noun. This thus appears to be another mislabeled `pdp` value, since it should read `subs`. This, like the previous example, will not receive a head value due to the mistake." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieving Quantified Words\n", "\n", "When the heads algorithm looks for a noun without any subphrase relations in the phrase, it will often return a quantifier noun such as a number, e.g. 砖讘注讛 \"seven\", or such as another descriptor like 讻诇. But these words function semantically in a more descriptive role than a head role. Thus, we want our algorithm to isolate quantified nouns from their quantifiers. To do that means we must first know how the ETCBC encodes the relationship between a quantifier and the quantified noun. \n", "\n", "In a previous algorithm used for quantified extraction, we looked for a nomen regens relation on the quantifier and located the noun within the related subphrase. This approach works well for the quantifier 讻诇. But for cardinal numbers, the relation `adj` (adjunct) is often used as well (as seen in the surveys below).\n", "\n", "To illustrate with the search below, the quantifier 砖讘注讛 \"seven\" has no nomen regens relation:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "#B.show([(2217,)]) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than reflecting a regen/rectum relation, the second word 砖谞讬诐 \"years,\" the quantified noun, has a subphrase relation of `adj` \"adjunct\":" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "砖侄讈郑讘址注 砖指讈谞执謹讬诐 讜旨砖职讈诪止谞侄芝讛 诪值讗止謻讜转 砖指讈谞指謶讛 \n", "\n", "1301096 (subphrase)\n", "\t 砖指讈谞执謹讬诐 \n", "\t rela: adj\n", "\n", "1301097 (subphrase)\n", "\t 砖侄讈郑讘址注 砖指讈谞执謹讬诐 \n", "\t rela: NA\n", "\n" ] } ], "source": [ "print(T.text(L.d(652883, 'word')))\n", "print()\n", "\n", "for sp in L.u(2218, 'subphrase'): # subphrases belonging to \"years\"\n", " print(sp, '(subphrase)')\n", " print('\\t', T.text(L.d(sp,'word')))\n", " print('\\t', 'rela:', F.rela.v(sp))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what other kinds of subphrase relations are reflected by quantifieds.\n", "\n", "Below we make a survey of all mother-daughter relations between a quantifier subphrase and its daughters. The goal is to isolate those relationships which contain the quantified noun. We work through examples to get an idea of the meaning of the features. And we write a few TF search queries further below to confirm hypotheses about these relationships." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">XD/\n", "\t par - 62\n", "\t adj - 42\n", "\t rec - 69\n", "\t Appo - 1\n", "\t Spec - 6\n", "\t atr - 8\n", "\t mod - 4\n", "CNJM/\n", "\t rec - 345\n", "\t adj - 267\n", "\t par - 109\n", "\t mod - 14\n", "\t atr - 5\n", "\t Spec - 6\n", "\t dem - 3\n", "\t Sfxs - 1\n", ">RBH/\n", "\t rec - 30\n", "\t adj - 302\n", "\t par - 217\n", "\t atr - 2\n", "CMNH/\n", "\t adj - 128\n", "\t par - 28\n", "\t rec - 8\n", "\t mod - 1\n", "TCLP=/\n", "\t adj - 143\n", "\t rec - 33\n", "\t par - 198\n", "\t Spec - 1\n", "\t atr - 2\n", "/\n", "\t adj - 3\n", "\t par - 3\n", "XD/\n", "\t atr - 1\n", "\t Spec - 2\n", "CTJN/\n", "\t par - 1\n", "CT/\n", "TLT/\n", "\t adj - 2\n", "TRJN/\n", "\t rec - 2\n", ">LP/\n", "\t rec - 1\n", "XD/']['atr']) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Spec` (phrase_atom rela) are cases where a phrase atom is used to add adjectival information about the quantifier." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "#B.show(quant_ex['>XD/']['Spec']) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `mod` relation are cases where the quantifier is modified with particles like 讙诐 or 专拽" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "#B.show(quant_ex['CNJM/']['mod']) # uncomment me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `dem` relation is when a demonstrative like 讗诇讛 modifies the quantifier." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "#B.show(quant_ex['CB Subs Missed Results\n", "\n", "In the first test, several substantives are missed due to the presence of an adjectival element. Let's look at those cases and see what's going on. I have copied the phrase numbers of a few relevant examples." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "adj_examples = [(771933,), (799523,)]\n", "\n", "#B.show(adj_examples) # uncomment me" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "subphrase \ttext \trelation \tmother\n", "讘侄谉志 1355711 NA ()\n", "讗指诪止謹讜抓 1355712 rec (207817,)\n" ] } ], "source": [ "show_subphrases(adj_examples[0][0])" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "讬职砖址纸讈注职讬指郑讛讜旨 nmpr\n", "讘侄谉志 subs\n", "讗指诪止謹讜抓 nmpr\n" ] } ], "source": [ "for word in L.d(adj_examples[0][0], 'word'):\n", " print(T.text([word]), F.pdp.v(word))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the substantive is not detected by the algorithm since it is in a dependent subphrase, a construct relation, with its modifying adjective. How to extract these nouns?\n", "\n", "This is very similar to the quantifier case, where the word in the rectum is actually the head (e.g. 砖转讬 砖谞讛 \"two years\" where \"two\" is registered as the head, but the substantive \"years\" is the semantic head). This kind of relationship is differentiated from non-heads by the fact that the adjective itself is independent. Thus, in cases where the adjective is independent and has a daughter rectum subphrase, the algorithm should retrive the attributed noun.\n", "\n", "**proposed solution**: Add `adjv` to the set of acceptable `pdp` for the `NP`. Any adjectives will be processed for dependency: most will fail that test. But for the dozens of cases where the adjective does not fail, the algorithm will apply a separate check for a `rec` related subphrase which contains the true head.\n", "\n", "### Participle -> Head Missed Results\n", "\n", "Other phrases that end up headless are noun phrases that have a participle which serves as a the nominal element, but since it has satellites is coded as a \"verb\":" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "verb_examples = [(709010,), (711593,), (756104,)]\n", "\n", "#B.show(verb_examples) # uncomment me" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "subphrase \ttext \trelation \tmother\n", "\n", "subphrase \ttext \trelation \tmother\n", "\n", "subphrase \ttext \trelation \tmother\n", "\n" ] } ], "source": [ "for phrase in verb_examples:\n", " show_subphrases(phrase[0])\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are mixed cases here due to the shortcomings of the current data model. In these cases, the participle is marked as a \"verb\" since it also has objects or descriptors. In the first example above, the noun 讙专讛 functions as the *object* of the verb. The head is 诪注诇讛. But the same logic does not hold for the second or third case. In the second case, 止 驻爪讜注志讚讻讗 gives an *attribute* or quality of 砖驻讻讛. In the third case, 诪爪拽 \"poured\", describes an attribute of 谞讞砖转 \"bronze.\" Thus the opposite of example 1 is true, that is, the head noun is the attributed noun in the construct relation.\n", "\n", "Since the specific role of the noun or the verb is not specified at this lower phrase level, is there even a way to differentiate these cases?\n", "\n" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "709010\n", "讜职 conj\n", "\n", "711593\n", "讗值芝讬谉 nega\n", "\n", "756104\n", "讘值讬转止讜蜘 subs\n", "\n" ] } ], "source": [ "for phrase in verb_examples:\n", " print(phrase[0])\n", " for word in L.d(phrase[0], 'word'):\n", " print(T.text([word]), F.pdp.v(word))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It actually appears that the database treats all 3 the same: as adjectives at the phrase-dependent part of speech level. Thus, these cases will receive the same treatments as the adjective cases above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### KL/ relation problems\n", "\n", "I found an instance in Number 3:15 where the subphrase relationship that connects 讻诇 with its quantified noun is \"atr.\" That is probably wrong. Are there other cases with the same problem? " ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kl_prob = '''\n", "\n", "sp1:subphrase\n", " w1:word lex=KL/ st=a\n", "\n", "sp2:subphrase rela=atr\n", "\n", "sp2 -mother> sp1\n", "sp2 >> sp1\n", "w1 :: sp1\n", "'''\n", "\n", "kl_prob = sorted(B.search(kl_prob))\n", "\n", "len(kl_prob)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "#B.show(kl_prob) # uncomment me" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems that the adjectives are not nominalized in this construction as pdp of `subs`. Most of the findings are adjectives in construct with 讻诇. But there are several cases of the participle also.\n", "\n", "Is this encoding correct?\n", "\n", "If the `rela` code were properly `rec` as most are, then this would simply be a matter of adding an additional acceptable `pdp` to the list within the get_quantified function." ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#kl_prob = [r for r in kl_prob if not {'adjv'} and set(F.pdp.v(w) for w in L.d(r[2]))]\n", "\n", "len(kl_prob)" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kl_prob = set(r[0] for r in kl_prob)\n", "\n", "len(kl_prob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subphrase by Subphrase Approach?\n", "\n", "Experimenting with switching from a word-by-word approach to a subphrase-by-subphrase. The first iteration of the get_heads function iterated word by word to identify valid heads with independent subphrase relations. A more efficient, and methodologically sound approach would be to work from the subphrase down to the word. Here I experiment with such a method." ] }, { "cell_type": "code", "execution_count": 193, "metadata": {}, "outputs": [], "source": [ "test_phrases = [ph for ph in F.typ.s('NP') \n", " if len(L.d(ph, 'word')) == 5 \n", " and F.otype.v(ph) == 'phrase']" ] }, { "cell_type": "code", "execution_count": 194, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "655731" ] }, "execution_count": 194, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = test_phrases[20]\n", "\n", "test" ] }, { "cell_type": "code", "execution_count": 195, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "subphrase \ttext \trelation \tmother\n", "讬职诇执芝讬讚 讘值旨纸讬转职讱指謻 1302879 NA ()\n", "讬职诇执芝讬讚 1302877 NA ()\n", "讘值旨纸讬转职讱指謻 1302878 rec (7530,)\n", "诪执拽职谞址郑转 讻址旨住职驻侄旨謶讱指 1302882 par (1302879,)\n", "诪执拽职谞址郑转 1302880 NA ()\n", "讻址旨住职驻侄旨謶讱指 1302881 rec (7533,)\n" ] } ], "source": [ "show_subphrases(test)" ] }, { "cell_type": "code", "execution_count": 196, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1302879, 1302877, 1302880]" ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "head_cands = [sp for sp in L.d(test, 'subphrase') if F.rela.v(sp) == 'NA']\n", "\n", "head_cands" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note above that the heads are those within NA relations that consist of single words. How consistent is this? Are there any cases where the head does not receive its own individual subphrase with a NA relation? Or are there cases of NA relations of non-head elements? Below we run a couple of tests, and then we build a primitive head finder based on this hypothesis in order to manually inspect what happens." ] }, { "cell_type": "code", "execution_count": 201, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "example found: \n", "23 (1300568,) {'rec'}\n", "search complete with 0 results\n" ] } ], "source": [ "for word in F.otype.s('word'):\n", " \n", " subphrases = L.u(word, 'subphrase')\n", " \n", " if not subphrases:\n", " continue\n", " \n", " sp_relas = set(F.rela.v(sp) for sp in subphrases)\n", " \n", " if not {'NA', 'par'} & sp_relas:\n", " print('example found: ')\n", " print(word, subphrases, sp_relas)\n", " break\n", " \n", "print('search complete with 0 results')" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'转职讛止謶讜诐 '" ] }, "execution_count": 202, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T.text(L.d(1300568, 'word'))" ] }, { "cell_type": "code", "execution_count": 224, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "45161" ] }, "execution_count": 224, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_na = '''\n", "\n", "sp1:subphrase\n", " w1:word\n", "sp2:subphrase\n", "\n", "sp2 -mother> w1\n", "'''\n", "\n", "no_na = sorted(S.search(no_na))\n", "\n", "len(no_na)" ] }, { "cell_type": "code", "execution_count": 229, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "words with construct relation and no NA subphrase: 0\n" ] } ], "source": [ "no_na_filtered = []\n", "\n", "for r in no_na:\n", " \n", " reg = r[1]\n", " \n", " reg_subphrases = L.u(reg, 'subphrase')\n", " reg_sp_relas = set(F.rela.v(sp) for sp in reg_subphrases)\n", " \n", " if 'NA' not in reg_sp_relas:\n", " no_na_filtered.append(r)\n", " \n", "print(f'words with construct relation and no NA subphrase: {len(no_na_filtered)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The search above shows that in any case that a word is in a construct relation with a subphrase, a NA (no relation) subphrase exists. \n", "\n", "Let's broaden the inquiry a bit. What are the specific situations in which there is NO non-related subphrase at all. What kinds of relations are present? What kinds of phrases are they?" ] }, { "cell_type": "code", "execution_count": 236, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({'NO subphrases': 215258, 'has NA': 37952})\n" ] } ], "source": [ "na_survey = collections.Counter()\n", "\n", "for phrase in F.otype.s('phrase'):\n", " \n", " subphrase_relas = tuple(sorted(set(F.rela.v(sp) for sp in L.d(phrase, 'subphrase'))))\n", " \n", " if not subphrase_relas:\n", " na_survey['NO subphrases'] += 1\n", " \n", " elif 'NA' in subphrase_relas:\n", " na_survey['has NA'] += 1\n", " \n", " else:\n", " na_survey[subphrase_relas] += 1\n", " \n", "pprint(na_survey)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This count shows that there are only two situations in the data: either \n", "\n", "1) a phrase has no subphrases present, or \n", "\n", "2) it has a subphrase with a relation of \"NA\". There are NO cases of phrases that lack an NA subphrase but have other relations. That is good for our hypothesis...\n", "\n", "In the experiment below, two important assumptions are made about the head:\n", "\n", "**First**, it is assumed that **the head is the first valid `pdp` word in the phrase**, with the exception of quantifieds and attributed nouns which are handled differently. \n", "\n", "**Second**, it is assumed that the **first NA-relation subphrase contains the head**. We test that assumption by manually inspecting the output." ] }, { "cell_type": "code", "execution_count": 292, "metadata": {}, "outputs": [], "source": [ "def primitive_head_hunter(phrase):\n", " \n", " '''\n", " Looks at noun phrases for heads.\n", " '''\n", " \n", " good_pdp = {'subs', 'nmpr'}\n", " \n", " subphrase_candidates = [sp for sp in L.d(phrase, 'subphrase') \n", " if F.rela.v(sp) == 'NA'\n", " and F.rela.v(L.u(sp, 'phrase_atom')[0]) == 'NA'\n", " ]\n", " \n", " # handle simple phrases\n", " if not subphrase_candidates:\n", " head_candidates = [w for w in L.d(phrase, 'word') if F.pdp.v(w) in good_pdp]\n", " try:\n", " return (head_candidates[0],)\n", " except:\n", " print(f'exception at {phrase}')\n", " \n", " # attempt simple head assignment\n", " first_na_subphrase = subphrase_candidates[0]\n", " try:\n", " the_head = next(w for w in L.d(first_na_subphrase, 'word') if F.pdp.v(w) in good_pdp)\n", " return (the_head,)\n", " except:\n", " if F.pdp.v(L.d(first_na_subphrase, 'word')[0]) == 'adjv':\n", " pass\n", " else:\n", " raise Exception(phrase)" ] }, { "cell_type": "code", "execution_count": 296, "metadata": {}, "outputs": [], "source": [ "test_results = [primitive_head_hunter(ph) for ph in test_phrases]\n", "\n", "random.shuffle(test_results)" ] }, { "cell_type": "code", "execution_count": 298, "metadata": {}, "outputs": [], "source": [ "#B.show(test_results) # uncomment me" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As it turns out, the assumption about NA phrase type is workable. But the complications of this approach (explained below) make it an unlikely solution for now." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Conclusion\n", "\n", "I've done some initial testing with the subphrase by subphrase approach. It is a promising method, but requires a more complicated implementation with nested searches through each level of the phrase hierarchy. A simple subphrase by subphrase approach is not sufficient鈥攐ne needs to go phrase by phrase, phrase_atom by phrase_atom, subphrase by subphrase, and even beyond. It is a recursive problem that cannot be navigated with the present, limited data model. There is more to say about the present state of the data model which I will save for the final report.\n", "\n", "At present, the word-by-word approach provides an elegant (though limited) solution that is able to navigate the quirks of the present data model and provide an acceptable level of accuracy, with some exceptions for more complicated phrase constructions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handling Parallels\n", "\n", "What is the best way to handle parallel head elements? In general, a phrase has only one real \"head\". That is, often the first head element determines the grammatical gender or number of the verb (thanks to Constantijn Sikkel for this conversation). Yet, the nouns which are coordinate to the head are often of interest for both grammatical and semantic studies.\n", "\n", "There are two approaches to collecting coordinate heads. One is to check for every word with a relation of \"parallel\" whether its mother is already established as a head. Another approach is to recursively search for nouns that are coordinate with the head word. Up until this inquiry, I have opted for option 1 due to the complexity of checking necessary relationships for a head candidate. But a phrase in Deuteromoy 12:17 is then missed by this current approach, since there is there a chain of head nouns in construct with the quantifier 诪注砖专 \"tenth\". These cases are missed. \n", "\n", "It is possible to edit the algorithm to accomodate these cases. But the example raises the broader question of whether option 1 is truly sufficient and methodologically sound. In this section, I test whether option 2 is a better alternative. First, we are unsure about how to separate a head word from a larger, paralleled subphrase. In option 1, individual words are tested, each for dependent relationships. But option 2 will go the opposite direction: beginning at the subphrase level and working down to the word. Does this affect our ability to separate the head noun of the phrase?" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "def OLD_get_heads(phrase):\n", " '''\n", " Extracts and returns the heads of a supplied\n", " phrase or phrase atom based on that phrase's type\n", " and the relations reflected within the phrase.\n", " \n", " --input--\n", " phrase(atom) node number\n", " \n", " --output--\n", " tuple of head word node(s) \n", " '''\n", " \n", " # mapping from phrase type to good part of speech values for heads\n", " head_pdps = {'VP': {'verb'}, # verb \n", " 'NP': {'subs', 'adjv', 'nmpr'}, # noun \n", " 'PrNP': {'nmpr', 'subs'}, # proper-noun \n", " 'AdvP': {'advb', 'nmpr', 'subs'}, # adverbial \n", " 'PP': {'prep'}, # prepositional \n", " 'CP': {'conj', 'prep'}, # conjunctive\n", " 'PPrP': {'prps'}, # personal pronoun\n", " 'DPrP': {'prde'}, # demonstrative pronoun\n", " 'IPrP': {'prin'}, # interrogative pronoun\n", " 'InjP': {'intj'}, # interjectional\n", " 'NegP': {'nega'}, # negative\n", " 'InrP': {'inrg'}, # interrogative\n", " 'AdjP': {'adjv'} # adjective\n", " } \n", " \n", " # get phrase-head's part of speech value and list of candidate matches\n", " phrase_type = F.typ.v(phrase)\n", " head_candidates = [w for w in L.d(phrase, 'word')\n", " if F.pdp.v(w) in head_pdps[phrase_type]]\n", " \n", " # VP with verbs require no further processing, return the head verb\n", " if phrase_type == 'VP': \n", " return tuple(head_candidates)\n", " \n", " # go head-hunting!\n", " heads = []\n", " \n", " for word in head_candidates:\n", " \n", " # gather the word's subphrase (+ phrase_atom if otype is phrase) relations\n", " word_phrases = list(L.u(word, 'subphrase'))\n", " word_phrases += list(L.u(word, 'phrase_atom')) if (F.otype.v(phrase) == 'phrase') else list()\n", " word_relas = set(F.rela.v(phr) for phr in word_phrases) or {'NA'}\n", "\n", " # check (sub)phrase relations for independency\n", " if word_relas - {'NA', 'par', 'Para'}: \n", " continue\n", " \n", " # check parallel relations for independency\n", " elif word_relas & {'par', 'Para'} and mother_is_head(word_phrases, heads):\n", " this_head = find_quantified(word) or find_attributed(word) or word\n", " heads.append(this_head)\n", " \n", " # save all others as heads, check for quantifiers first\n", " elif word_relas == {'NA'}:\n", " this_head = find_quantified(word) or find_attributed(word) or word\n", " heads.append(this_head)\n", " \n", " return tuple(sorted(set(heads)))\n", "\n", "def mother_is_head(word_phrases, previous_heads):\n", " \n", " '''\n", " Test and validate parallel relationships for independency.\n", " Must gather the mother for each relation and check whether \n", " the mother contains a head word. \n", " \n", " --input--\n", " * list of phrase nodes for a given word (includes subphrases)\n", " * list of previously approved heads\n", " \n", " --output--\n", " boolean\n", " '''\n", " \n", " # get word's enclosing phrases that are parallel\n", " parallel_phrases = [ph for ph in word_phrases if F.rela.v(ph) in {'par', 'Para'}]\n", " # get the mother for the parallel phrases\n", " parallel_mothers = [E.mother.f(ph)[0] for ph in parallel_phrases] \n", " # get mothers' words, by mother\n", " parallel_mom_words = [set(L.d(mom, 'word')) for mom in parallel_mothers]\n", " # test for head in each mother\n", " test_mothers = [bool(phrs_words & set(previous_heads)) for phrs_words in parallel_mom_words] \n", " \n", " return all(test_mothers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How many subphrases with a parallel relation to a validated head consist of more than one word?\n", "\n", "We take the first head element for every noun phrase and check its parallel elements." ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "length: 1\n", "\t 3346\n", "length: 2\n", "\t 686\n", "length: 3\n", "\t 86\n", "length: 4\n", "\t 24\n", "length: 9\n", "\t 10\n", "length: 5\n", "\t 4\n", "length: 6\n", "\t 3\n" ] } ], "source": [ "par_word_count = collections.Counter()\n", "par_word_list = collections.defaultdict(list)\n", "\n", "for np in F.typ.s('NP'):\n", " \n", " heads = OLD_get_heads(np)\n", " \n", " if not heads:\n", " continue\n", " \n", " the_head = heads[0]\n", " \n", " if not L.u(the_head, 'subphrase'):\n", " continue\n", " \n", " head_smallest_sp = sorted(sp for sp in L.u(the_head, 'subphrase'))[0]\n", " \n", " par_daughter = [d for d in E.mother.t(head_smallest_sp) if F.rela.v(d) == 'par']\n", " \n", " for pd in par_daughter:\n", " \n", " word_length = len(L.d(par_daughter[0], 'word'))\n", " \n", " par_word_count[word_length] += 1\n", " par_word_list[word_length].append((the_head, head_smallest_sp))\n", " \n", "for w_count, count in par_word_count.items():\n", " print('length:', w_count)\n", " print('\\t', count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see some of the larger cases..." ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [], "source": [ "# B.show(par_word_list[6]) # uncomment me" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1409908 par 讗止爪职专止芝讜转 诪址讗植讻指謻诇 讜职砖侄讈芝诪侄谉 讜指讬指纸讬执谉變 \n" ] } ], "source": [ "ex_subphrase = par_word_list[6][1][1]\n", "\n", "for daughter in [d for d in E.mother.t(ex_subphrase) if F.rela.v(d) == 'par']:\n", " \n", " print(daughter, F.rela.v(daughter), T.text(L.d(daughter, 'word')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now some examples of 2 word lengths..." ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "#B.show(par_word_list[2][:5]) # uncomment me" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These examples raise an important possibility. If we take the first word labeled \"subs\" (substantive) within the parallel, will that give us the coordinate head?\n", "\n", "Below is an example from Genesis 5:3 which shows a potential pitfall of the method 2 approach, and even of the current approach." ] }, { "cell_type": "code", "execution_count": 238, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "652845 砖职讈诇止砖执讈证讬诐 讜旨诪职讗址转謾 砖指讈谞指謹讛 \n", "subphrase \ttext \trelation \tmother\n", "砖职讈诇止砖执讈证讬诐 1301065 NA ()\n", "诪职讗址转謾 砖指讈谞指謹讛 1301068 par (1301065,)\n", "诪职讗址转謾 1301066 NA ()\n", "砖指讈谞指謹讛 1301067 rec (2154,)\n" ] } ], "source": [ "example = L.d(T.nodeFromSection(('Genesis', 5, 3)), 'phrase')[3]\n", "\n", "print(example, T.text(L.d(example,'word')))\n", "show_subphrases(example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example above illustrates the shortcoming of even the current method of separating quantifiers and quantifieds, as seen in this result:" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'砖职讈诇止砖执讈证讬诐 |砖指讈谞指謹讛 '" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'|'.join(T.text([h]) for h in get_heads(example))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The algorithm retrieves both \"thirty\" and \"year,\" even though the only head in this case is \"year\". This is a shortcoming of the quantifier function, which in this case has not detected a complex quantifier that is formed with a parallel relation.\n", "\n", "The quantifier algorithm should have passed 砖诇砖讬诐 along to another test before validating it as a head. That is, it should look for this case of a complex quantifier. This is actually another good reason to change the parallels finder to approach 2, so that parallels are processed at the head level rather than disconnected from it. In this setup, the algorithm will gather all parallels to the head. If the syntactic head is a quantifier. If it has no quantified noun, then the algorithm will look further at the parallel relationship to see if it is also a quantifier. If it is, then it will look to find that quantifier's substantive and return it instead. This is a complex recursive process that will have to be coded." ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [], "source": [ "#B.show(par_word_list[3][:15]) # uncomment me" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Are there cases where there is multiple coordinate relations with a single subphrase?" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_multiple_coor = '''\n", "\n", "sp1:subphrase\n", "sp2:subphrase rela=par\n", "sp3:subphrase rela=par\n", "\n", "sp2 -mother> sp1\n", "sp3 -mother> sp1\n", "\n", "sp2 # sp3\n", "'''\n", "\n", "test_multiple_coor = sorted(S.search(test_multiple_coor))\n", "\n", "len(test_multiple_coor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No, it does not happen. Thus, coordinate relations are chained to each other, not multiplied to a single mother." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Conclusions\n", "This inquiry sparked the one above it about the subphrase by subphrase approach. We have decided to table this method for now." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }