Converted from a Word document
There is growing interest in intangible cultural heritage for narrative generation (Peinado and Gervás, 2006), yet with no test collections of tales or myths available, the computational modeling of narratives is still in its infancy (Finlayson et al., 2014). On the other hand, folk narratives are a global phenomenon, amply documented by fieldwork. These on the one hand pin down the major tale types as canonical motif sequences (Uther, 2004), so that storyline variants can be grouped under such types as document classes. Also, there exist lists of indexing criteria, which can be used to describe the content of such narratives on different conceptual levels (Thompson, 1955–1958). Therefore, methods that can accommodate nested compositional structure are required to index such material.
At the same time, information societies increasingly exploit digitized content for knowledge discovery, the scalability of data often leading to unforeseen opportunities for data analytics (Virshup et al., 2013). As there are apparent structural similarities between the recombinative transmission mechanisms of hereditary material in biology vs. cultural memories inherent, e.g., in texts, departing from earlier work on narrative genomics (Darányi et al., 2012; Ofek et al., 2013), and by calling in significant theoretical and development progress in analogical information retrieval and inference in the biomedical domain (Cohen et al., 2012; 2014), here we report on a new integrated methodology for narrative analysis. The idea goes back to literature-based discovery (Swanson, 1986; Cohen et al., 2010), and implements Propp’s formalism to analyze Russian fairy tales as representations of sets of concept-relation-concept triplets, or predications, in high-dimensional space.
Predication-based semantic trawling (PBST) is designed to both aggregate material conformant with specific semantic markup in a knowledge base and to improve the robustness of the same knowledge base by a feedback loop. The idea goes back to the metaphor of a fishing net with variable mesh size, implemented as (1) a knowledge base with a learning component; (2) storing semantically marked-up source material as a test collection, where markup regulates content granularity, i.e., the mesh size; (3) using the knowledge base for information filtering (Brin, 1998; Agichtein and Gravano, 2000) by trawling external data; (4) iteratively expanding the test collection by trawling results; and (5) periodically analyzing incoming data with the results fed back to the learning base.
Our new methodology is designed with scalability, quality, and automation in mind. To this end, we consider the respective integration of available technological steps into a single processing workflow as the missing link to get from little to big data in folk narrative analysis (Malec et al., 2014). These technological steps come from a combination of biomedical text analysis with the Proppian analysis of fairy tales.
In biomedical text analysis, Predication-based Semantic Indexing (PSI) provides the means to efficiently search across tens of millions of concept-relation-concept triplets, known as semantic predications, extracted from the biomedical literature using a Natural Language Processing (NLP) system called SemRep (Rindflesch and Fiszman, 2003). SemRep uses the UMLS
1 and MetaMap
2 to map relevant expressions from free text to concepts in a controlled vocabulary, and extracts relationships between these concepts using underspecified syntactic parsing, a set of indicator rules, and constraints present in the UMLS semantic network. PSI derives high-dimensional vector representations of concepts from the predications they occur in, effectively circumventing the combinatorial explosion of possible pathways between concepts by converting the task of traversing individual predications into the task of measuring the similarity between composite concept vectors. Consequently, search time for single, double, or triple predicate paths is identical once the relevant concept vectors have been constructed. This method can also detect double and triple predicate pathways connecting example pairs of therapeutically related drugs and diseases, and use these inferred pathways to guide search for treatments for other diseases. Further, PSI has been used to mediate semantic search by utilizing high-dimensional vector representations to infer the nature of the relationship between query concepts and other concepts in relevant documents. Inference is accomplished in high-dimensional space using Expansion-by-Analogy, a novel analogical approach to pseudo-relevance feedback, in which the relationships between query concepts and other concepts in documents they occur in guide the query expansion process. The semantic vector–based approaches developed show improvements in performance over a baseline bag-of-concepts model, and these are most pronounced on queries that are not conducive to keyword-based search (Cohen et al., 2014). This approach can be used to create predication-based semantic space for folk narratives.
V. J. Propp’s theory that the canonical form of Russian fairy tales is a compulsory sequence of actions—called
functions and selected from a list of 31 typical activities performed by typical actors—was based on a limited sample of cca 50 fairy tales from the Afanas’ev collection, itself comprising cca 600 stories, selected and compiled in the 19th century (Propp, 1968). Whereas the in-principle applicability of the scheme, with or without modifications, has been extensively debated ever since, researchers have started to look at the reproduction of Propp’s conclusions only recently (Bod et al., 2012). The scheme lends itself to semantic markup (Malec, 2010; Lendvai et al., 2010), with subject-verb-object (SVO) triples underlying Proppian functions suitable for predicate encoding of tale content, an observation first publicly observed among Western readers by Rumelhart (1975).
The following are typical examples of predication from the biomedical informatics domain:
Concept_1
Relation
Concept_2
Mammalian Oviducts LOCATION_OF Sterility
Thymol turbidity test DIAGNOSES Disease
Epididymis LOCATION_OF Obstruction
In comparison, predicates based on Russian fairy tales look like the following:
Concept_1
Relation
Concept_2
Baba Yaga IS_A donor
Golden apple IS_A gift
Baba Yaga LIVES_IN hut on chicken legs
Donor GIVES gift (direct object)
Donor GIVES_TO protagonist (indirect object)
In the Proppian sample corpus, e.g., Baba Yaga gives a magic apple to Ivan Simpleton, who therefore is the protagonist,
3 exemplifying the ‘donor’ function. The following tale segment helps to identify this function (Afanas’ev 96: ‘Morozko/Jack Frost’):
The poor little thing remained there shivering and softly repeating her prayers.
Jack Frost came leaping and jumping and casting glances at the lovely maiden.
‘Maiden, maiden, I am Jack Frost the Ruby-nosed!’ he said.
‘Welcome, Jack Frost! God must have sent you to save my sinful soul.’
Jack Frost was about to crack her body and freeze her to death, but he was touched by her wise words, pitied her, and tossed her a fur coat.
4
Here the ‘maiden’, a stepdaughter, is ordered by her stepmother to be left out to the elements in the forest, a plot element representative of Cinderella-like tales (ATU Type 510A) called ‘mat’ padcheritsa’ or ‘Zolushka’ tales in Russian folklore tradition. Jack Frost is clearly the donor. The verb ‘tossed’ is mapped to the ‘GIVES_TO’ relation for its predicate
.
Steps in the processing workflow are as follows:
• Apply PoS tagging,
5 anaphora resolution,
6 and semantic role labeling with SemLink
7 (Palmer et al., 2010).
• Normalize concepts and predicates and identify triplets (Rusu et al., 2007).
* Decompose tales sentence by sentence, noting context with respect to Proppian function sequence.
* Shallow parsing of texts using Stanford parser or OpenNLP to identify SVO.
* Extract triples.
• Situate these normalized concept and predicate triplets as metadata within the marked-up corpus as the Proppian SemRep (PSR) output.
* Through a kind of hermeneutic circle, create an analogy of the UMLS in biomedical domain, except with domain knowledge particular to folk narratives, e.g., a controlled vocabulary for predicates (GIVES_TO, ACCEPTS, PUNISHES, etc.), concepts (hero).
* Index predications into PSI-space and index corpus with permuted random indexing (PRI), then visualize results by Epiphanet (Schvaneveldt, 1990) and the query/d3.js interface.
* Extract triples from the Morphology itself to create a knowledge base for purposes of reasoning about the Russian fairy tale domain.
The PSR tool allows for the sequential encoding of marked-up text by circular holographic reduced representation (De Vine and Bruza, 2010), the binary spatter code (Kanerva, 1996), or PRI (Sahlgren et al., 2008) to map them in predication space. For this, different open-source software components are available, prominently the SemanticVectors package.
8 The output of PSR enables the analogical retrieval of predicates, finding missing pieces of evidence to reason about dramatic personae and the plot logic of magic tales.
The core of the tale set for the initial knowledge base are a subset of more than 45 tales in English translation to reflect the subset of Afanas’ev’s tales in Propp’s schema in appendix III. The second sweep will cover a larger subset of the Afanas’ev collection, namely that which Propp himself stated at the outset as his concern
9 before the semantic trawling of web resources may commence.
We have presented here a very rough model and a proof of concept. There will likely be room to address some aspects of the complex issue of the dynamic learning of narrative macro-structures, which tend to be distinctive of particular traditions and populations (Colby, 1973). Finally, our contribution ought to be seen in the light as ‘bootstrapping’ an artificial system to facilitate analogical reasoning about a limited subset of themes, structures, and encoding devices of traditional narrative.
Notes
1. http://www.nlm.nih.gov/research/umls/.
2. http://metamap.nlm.nih.gov/.
3. This is a special case where we need an indirect object, Ivan Simpleton.
4. https://github.com/kingfish777/ProppModel/blob/master/000_096_Morozko.xml.
5. http://nlp.stanford.edu/software/lex-parser.shtml.
6. http://cswww.essex.ac.uk/Research/nle/GuiTAR/gtarNew.html.
7. http://verbs.colorado.edu/semlink/. Semlink can create mappings between PropBank, VerbNet, WordNet, and FrameNet.
8. https://code.google.com/p/semanticvectors/.
9. Tales #93-270 from Afanas’ev, inclusive of the tales already mentioned.