Big Data and the Study of Allusion: an Exploration of Tesserae's Multitext Capability

Big Data and the Study of Allusion: an Exploration of Tesserae's Multitext Capability Gawley James O'Brien University at Buffalo, United States of America jamesgaw@buffalo.edu Diddams A. Caitlin University at Buffalo, United States of America acstaab@buffalo.edu 2016-03-06T23:48:00Z Maciej Eder, Pedagogical University in Krakow Jan Rybicki, Jagiellonian University

Institute of Polish Studies Pedagogical University ul. Podchorazych 2 30-084 Krakow, Poland maciej.eder@ijp-pan.krakow.pl

Converted from a Word document

DHConvalidator Paper Short Paper intertextuality literary studies big data classical studies corpora and corpus activities literary studies text analysis digital humanities - pedagogy and curriculum data mining / text mining English

This paper aims to explore the methodology and the potential of the “multi-text tool,” an advanced feature of the Tesserae Project. This tool is capable of identifying potentially interesting textual parallels between works, and collating all instances of these and similar phrases throughout the entire corpus of canonical Latin or Greek. As such it expands the researcher's ability to identify meaningful intertexts among Tesserae results, and makes it possible to quantify the literary influences which act upon a given work.

Tesserae is an online tool that aims to automatically identify allusions and more general forms of intertext between ancient authors. The program can identify specific intertexts between two works, as in previous research which has brought to light new parallels between Lucan's Bellum Civile and Vergil's Aeneid. A standard Tesserae search identifies a possible intertext when the same words (regardless of inflection) appear within the same phrase in two different texts. This 'bigram' measurement of similarity has been demonstrated to capture roughly 67% of intertexts previously noted in scholarship. The multitext function begins with a standard two-text Tesserae comparison; all bigrams shared between the two texts are then compared to a corpus of additional works selected by the user.

Although the first aim of most Tesserae users is to discover previously unnoted allusions, not all bigram similarities discovered by a Tesserae search are allusions. The program's scoring algorithm attempts to sort the meaningful intertexts sought by the researcher from undesired, coincidental overlap by considering the rarity of the shared words and their proximity to one another. This method has shown to be at least partially effective, yet it does not fully predict the assesment of the expert researcher. Because Tesserae results can number in the tens of thousands, further means of identifying desirable intertexts is necessary, particularly when the number of results is expanded by new search features such as semantic matching, introduced in October 2015. Tesserae's multitext search can be used to trace the history of a phrase throughout the corpus, and thereby eliminate oft-repeated bigrams. This method assumes that phrases appearing in many previous works are less likely to represent allusions.

In addition to identifying new allusions between two works, intertextual scholars often wish to measure the rate of connection between them. The level of active engagement between authors suggests the literary influence of a given work, and we propose a way to quanitfy this engagement. Using the multitext tool, the researcher can eliminate all widely-occuring textual parallels between potentially connected works and retain only unique results, then consider the rate of unique intertextuality against a baseline figure. Unlike the micro-level analysis of individual textual parallels, this macro-level analysis allows the reader to examine the intertextuality between works as a whole.

Our preliminary research shows that specific textual parallels with a large number of multitext results tend to indicate the use of generalized language rather than a meaningful communication between works. For example, Augustine’s language in De Doctrina Christiana bears measureable resemblance to the language of Quintilian, Cicero, and Tacitus. Yet scholars have long argued that Augustine’s primary influence was Cicero. Our multitexts results show Augustine’s engagement with Cicero include a large percentage of “unique” intertexts, seldom picked up by other authors. His connections to other authors were mostly composed of “general” intertexts, which appear to consist of standard language used by almost all Latin writers engaged in the discussion of rhetoric. The same method has been previously used to examine Claudian's differential level of interaction with the various authors of Latin epic. The examples of Claudian and Augustine demonstrate the efficacy of unique intertextual connections as a measurement of relative literary engagement.

Although the elimination of oft-repeated language is effective for eliminating meaningless intertexts when considering works at the macro-scale, the scholar in pursuit of specific, meaningful intertexts should not eliminate these parallels. Sometimes a phrase with a large number of multi-texts results is actually more significant than one with fewer multi-text results. Certain phrases are quoted and alluded to so often that they become “viral.” These instances do tend to apply in cases where a Tesserae match is ranked particularly high, but it ultimately remains the responsibility of the reader to sort meaningful intertexts from coincidental ones.

The multi-text tool expands the potential big-data research in the field of Classics, allowing for quantitative stylistic analysis at a rate of speed and a level of specificity previously impossible. It is an essential addition to the basic matching of the Tesserae Project, and we hope to see it become an element of digital humanities programs in various academic curricula. From specific analyses of sets of texts to more comprehensive explorations of style, the multi-text tool should be a fundamental resource in the study of allusion in Latin literature. In addition to demonstrating the efficacy of this tool, in this paper we explain how to use the results generated by a multi-text search to build an empirical measurement of authorial engagement, beginning with how to run the module on a personal computer and concluding with how to meaningfully analyze its results and tailor its parameters for individual projects.

Bibliography Coffee, N. et al. (2012). The Tesserae Project: Intertextual Analysis of Latin Poetry. Literary and Linguistic Computing, 28(1): 221-28. doi: 10.1093/llc/fqs033. Coffee, N. et al. Modeling the Interpretation of Literary Allusion with Machine Learning Techniques. Journal of Digital Humanities, 3(1). Coffee, N. and Forstall, C. (forthcoming). Claudian’s Engagement with Lucan in his Historical and Mythological Hexameters. In Berlincourt, V., Galli-Milic, L. and Nelis, D. (eds), Lucain et Claudien face à face: une poésie politique entre épopée, histoire et panégyrique. Winter Verlag. Conley, T. M. (1990). Rhetoric in the European Tradition. New York: Longman. Murphy, J. J. (1960). Saint Augustine and the Debate about a Christian Rhetoric. In Enos, R. L. and Thompson, R. (ed.), The Rhetoric of St. Augustine of Hippo: De Doctrina Christiana and the Search for a Distinctly Christian Rhetoric. Waco: Baylor University Press, pp. 205-17. Forstall, C. et al. (2015). Modeling the Scholars: Detecting Intertextuality through Enhanced Word-Level N-Gram Matching. Literary and Linguistic Computing, doi: 10.1093/llc/fqu014. Gawley, J., Forstall, C. and Coffee, N. Evaluating the literary significance of text re-use in Latin poetry, Poster presented at Chicago Colloquium on Digital Humanities and Computer Science, University of Chicago, Chicago, IL. November, pp. 17-19.