Converted from an OASIS Open Document
Aligning different versions of the same work is both a computational and a philological challenge. In particular, the collation of witnesses of an ancient or medieval text poses specific difficulties due to the coexistence of macro-structural and localised variants, including a large number of formal variants.
We present an experimental computer-assisted workflow for aligning several witnesses and classifyingvariants. Formal and substantive variants are examples of categories especially relevant for languages which are unstable in their graphic system, as are medieval languages. The case studies are in Old French, and, marginally, Old Spanish.
The distinction between formal and substantive variants enables to treat them separately. Stemmatology, for instance, will be mostly interested in the former (even if this has been challenged in Andrews, 2016), while, for linguistic analysis the latter are needed. In automatic collation, based on full transcription of the texts to be compared, the formal variation is generally preserved, but temporarily nullified by means of normalisation or fuzzy match: this enables an accurate alignment of the texts and at the same time the preservation of the original forms.
Medieval texts, especially in vernacular, often exhibit important variation. At the phrases or words levels, syntactic or graphic variations account for diachronic and diatopic differences, varying scribal practices and the plurality of graphematic standards. This makes it difficult to align sequences between texts, when they have very few letters in common, e.g.,
Cait del fuere |
Chiet dou fuerre |
Kiet du feurre (‘[The sword] falls of the scabbard’).
Difficulties due to spelling or flexional variation only add up to already existing variations in word order or substance. Consider the following example taken from Chrétien de Troye's
Chevalier au lion (Meyer, 2006, v. 3701):
HLi frans li dolz ou ert il donques
PLi frans li dous ou estoit donques
VLi franz li doz ou ert il donques
FLi frans li dols ou ert il donques
GLi biaux li preuz ou estoit donques
ALi preus li frans u est il donques
SLi preus li frans u ert il donques
RLi frans li dols u ert il donques
MLi frans li preus ou est il donques
Spelling (e.g.,
dolz,
dous,
doz,
dols) and flexional variants (
est,
ert,
estoit) go along with substitutions (
dous |
preus or
biaux |
frans), additions/deletions (
il), or permutations (
preuz). In such a case, clearing out spelling and flexional variation might help in resolving the other difficulties.
This paper offers a new approach to the normalisation task made possible by the developments in the field of NLP and the resources now available for medieval languages, following the steps described in fig. 1.
The initial step is the acquisition of the text, from the digital image, done by a combination of manual transcription (for producing ground truth), automated handwritten text recognition, and post-correction. The raw text thus obtained is then structured and stored in an XML/TEI based format. All these tasks are performed before the normalisation step, here represented by lemmatization and linguistic annotation, done with the help of neural network-based taggers/lemmatizers.
Traditionally, normalisation consists of the preparation of the texts for alignment and might imply lowercasing, removing punctuation or editorial markup, as well as the temporary removal of formal features (Silva and Love, 1969 ; Robinson, 1989). Our proposal is to move to an automatic normalisation performed using NLP tools. Each token (i.e. word) is annotated with linguistic information such as part of speech, lemma and detailed morphological information. This kind of normalisation is only possible when suitable resources are available. For Old Spanish, Freeling (Padró Stanilovsky, 2012) provides a specific module (Boleda, 2011; Porta et al., 2013). For Old French, we used the data provided by the
Geste corpus (Camps et al., 2016), annotated with lemmas, as well as POS and morph tags according to the Cattex scheme (Prevost et al., 2013). With this data, we trained a neural tagger/lemmatizer suitable for variation-rich languages (Kestemont et al., 2017 ; Manjavacas et al., 2019). On the test set, accuracy reached 94.5 and 95% for lemmatization and POS-tagging, and was in the range 94-98.5% for different morphological features.
After normalisation, the texts enriched with linguistic information can be used to perform the alignment. Variation in structure, order or content in medieval texts is favoured by the existence of ‘active textual transmission’ (Vàrvaro, 1970) and by processes of rewriting, prosification/versification, etc. Changes in the order of the structural entities (verses, paragraphs, etc.) are also common. In order to collate these displaced entities, a phase of macro-structural alignment might be needed. This process can be done by a combination of direct expertise and tools conceived for detecting paraphrase, text reuse or computing similarities (Büchler et al., 2014; Jänicke and Wrisley, 2018).
The very collation is then made by using the collation program CollateX (Dekker et al., 2011 and 2015) in its Python version. CollateX uses multiple alignment algorithms, suitable for the comparison of more than two witnesses (Spadini, 2017); its modular structure, based on the Gothenburg model, enables the user to intervene on each module separately and to add new ones.
All these software bricks can be integrated in a more complex pipeline up to the the final output. The modular structure of CollateX enables us to adjust the alignment and the visualization phases, in order to take into account the linguistic annotations added to each token. The alignment is performed directly on the annotation, used as a normalised form. In the creation of the output, some rules are added to compare the original forms with the annotation and to assign a category to the variant. For example, the category 'formal variant' is assigned to aligned tokens which have the same annotations but different original forms, such as:
mielz (pos: adverb; lemma:
mieus),
miels (pos: adverb; lemma:
mieus),
miaus (pos: adverb; lemma:
mieus).
Additional rules can be used for classifying variants into finer-grained categories, using linguistic annotation (fig. 2).
This paper presents some early results of an ongoing research on automatic collation and categorization of variants. Performing normalization using NLP tools not only speeds up the task, but also makes the identification of fine-grained categories possible. The case studies show the strong and weak points of this proposal and of the technical solutions for its implementation. Eventually, this research forces us to reflect upon the importance of having software components which are open and modular, in order to improve them and to include them in computational pipelines.