Digital Edition And Linguistic Database: A Fully Lemmatised And Searchable Model

Digital Edition And Linguistic Database: A Fully Lemmatised And Searchable Model Morcos Hannah Jane King's College London, United Kingdom hannah.j.morcos@kcl.ac.uk 2019-05-31T08:03:00Z Name, Institution

Street City Country Name

Converted from a Word document

DHConvalidator Paper Late breaking poster Digital Edition Lemmatisation Linguistics Manuscripts medieval studies scholarly editing digital textualities and hypertext linguistics English manuscripts description and representation

This poster reports on a fruitful collaboration between specialists in the fields of medieval French literature, lexicography, and digital humanities. The outcome will be the first fully lemmatized digital edition of a medieval French text, with a search page directly linked to the Dictionnaire Étymologique de l'Ancien Français ( DEAF).

As part of the ERC Advanced Grant project The Values of French Language and Literature in the European Middle Ages, we are editing the two most important manuscript copies of the earliest universal chronicle in French, the Histoire ancienne jusqu’à César. In addition to being the first edition of the whole text, it will provide an extensive corpus (over 350,000 words) of lemmatized and searchable linguistic data. We hope it will complement existing medieval French databases, such as the impressive Base de Français Médiéval (BFM) (Guillot-Barbance et al., 2017), which while largely based on edited texts is gradually increasing its content from manuscript sources.

The two manuscripts we are editing were created in regions where French was not a native language (13 th-century Acre and 14 th-century Naples). In order to analyse their philological as well as linguistic complexities, the edition design engages with the singularity of each witness. Following TEI guidelines, the XML transcription files are structured according to the presentation of the text on the manuscript page. Each manuscript is divided into multiple files saved on cloud storage, facilitating the distribution of labour. The semi-diplomatic transcription replicates scribal word division and medieval punctuation. The addition of a custom editorial shorthand renders the edited version, which has modern punctuation, diacritics, regularized word division, and minimal editorial corrections.

Thanks to a collaboration with colleagues at the DEAF, we are lemmatizing the edited text using a custom-made digital tool called Lemming. A ‘key word in context’ (KWIC) index generated from the edited text is uploaded to Lemming, which operates on the fly. Lemming enables the attribution of the same lemma to numerous keywords, the annotation of ambiguous cases, and multiple users to work at the same time. When users identify inaccuracies in the editorial work (e.g. missing accents, incorrect word division or letter forms, etc.), corrections are made to the TEI source files and the content is then re-uploaded to Lemming. The deciphering of changed or new keywords/contexts posed a particular challenge. In response to this issue, a new interface was created where “unverified imports” can be reviewed individually or en bloc, and new lemmas attributed if necessary. This conflict resolution has had significant benefits for our collaborative editorial workflow, making it possible to work in parallel on editing and on lemmatizing, and enhancing the accuracy of both.

Figure 1. Faceted search page

Over the last six months, the primary development has been the integration of the output KWIC from Lemming into the edition’s search page (see Fig. 1). Using a faceted search, users can filter results by lemma, form, prose/verse, direct/indirect speech, manuscript, and narrative section. Most notably, it is possible to sort the results according to the context immediately following the keyword. This offers us an important opportunity for the study of syntax, in particular the syntax of Old French negation and the conditions for the expression of pronominal/nominal subjects in subordinate clauses. For example, we are able to explore occurrences of the conjunction “que” immediately followed by the adverb of negation “ne”: “Quant Phutiphares oi ces paroles, sachiés que ne li pleurent mie” (Fr20125, § 281.1). In such cases, sentential negation is stated in a way that is not possible in later French. Whereas middle and modern French would need the overt and compulsory expression of the subject between “que” and “ne”, old and very old French do not (see Schøsler and Völker, 2014; and Ledgeway, forthcoming). This is just one of the exciting opportunities the lemmatized search results offer for the study of specific linguistic features, and in this case influences not only our understanding of the manuscripts that we are editing but, more importantly, of the history of the French language.

The first iteration of the search page will be published on the public site in summer 2019. The data will therefore be available for other researchers working on linguistics, lexicography, discourse analysis and literary studies, before and after the end of the project.

Figure 2. Editorial workflow

Underlying these developments is our workflow (see Fig. 2), which features custom-designed scripts that automatically aggregate the TEI source files, convert the editorial shorthand, tokenize the words (whilst maintaining the XML structure), and generate the KWIC index for Lemming as well as preview-able outputs on the project’s staging site. The fully automated and frequent conversion of the fragmented TEI sources takes place every two hours, enabling the verification of the final output as the end-users will see it. Furthermore, a custom-made printable proof-reading page (see Fig. 3) further facilitates incremental publication and pre-emptive problem-solving. By the project’s completion, the aggregated and converted TEI content will be available open access.

Figure 3. Printable proof-reading page

Bibliography Guillot-Barbance, C., Heiden, S. and Lavrentiev, A. (2017). Base de français médiéval : une base de référence de sources médiévales ouverte et libre au service de la communauté scientifique, Diachroniques, 7: 168-84. Ledgeway, A. (forthcoming). Word Order in the Histoire ancienne jusqu’à César. Secrets of Success. Or: How to preserve a Verb Second word order? (Oslo, 10-11 January 2019) Schøsler, L. and Völker, H. (2014). Intralinguistic and extralinguistic variation factors in Old French negation with ne-Ø, ne-mie, ne-pas and ne-point across different text types, Journal of French Language Studies, 24(1): 127-53. Morcos, H., Gaunt, S., Ventura, S., Rachetta, M. T. and Ravenhall, H. (eds). (2017-). The Histoire ancienne jusqu’à César . A digital edition. With technical support from P. Caton, G. Ferraro, M. Husar and G. Noël. Available at: https://tvof.ac.uk/histoire-ancienne (Accessed 30 May 2019).