<title type="main">Coping With The Complexity Of The TXM Platform Annotation Services With A Unified TEI Encoding Framework Heiden Serge ENS de Lyon, France slh@ens-lyon.fr 2019-04-26T16:12:36.814730624 Name, Institution
Street City Country Name

Converted from an OASIS Open Document

Paper Long Paper digital text encoding annotation XML TEI textometry TXM digital hermeneutics corpus and text analysis text encoding and markup languages natural language processing standards and interoperability English computer science and informatics digital humanities (history theory and methodology)
Introduction

TXM is a software platform offering textual corpora analysis tools and services. It is delivered as a standard desktop application for Windows, Mac and Linux and as a web portal server application (Heiden, 2010), < >.

TXM provides a consistent set of analysis tools combining qualitative (or close reading) tools such as word frequency lists, concordancing or text edition hypertextual navigation with synthetic quantitative (or distant reading) tools such as factorial analysis, clustering, keywords or co-occurrence statistical analysis.

To work on texts, the platform first imports the corpus sources to create a rich XML-TEI based internal pivot representation via the following general workflow:

first the “base text” of each text is established: this operation implements “digital philology” principles and consists of decoding information in the various formats of the source documents to decide primarily where are the text limits, possible internal textual structures boundaries and the words of the text.

To do this, TXM can analyze and represent three main types of corpora:

corpora of written texts, possibly including paginated editions including display side-by-side of the transcription and the images of facsimiles; record transcriptions corpora, possibly time synchronized at the word or at the speech turn level with the audio or video source to provide playback; and parallel multilingual corpora aligned at the level of textual structures such as sentences or paragraphs.

The result of this operation is represented in a pivot XML format especially designed for TXM called “XML-TEI TXM” extending the standard encoding recommendations of the Text Encoding Initiative consortium (TEI Consortium, 2017);

then, natural language processing (NLP) tools are optionally applied to the base text to automatically add linguistic information like sentence boundaries, grammatical categories (pos = part of speech) and lemma of words by eg TreeTagger (Schmid, 1994). As NLP tools generally don’t take XML format as input, the pivot representation is first converted to plain text for NLP processing and results are injected back into the XML-TEI TXM representation; finally a specialized representation of the texts is built into TXM for efficient execution of its different tools (by indexing for search engines or by rendering in HTML 5+CSS+Javascript for text editions navigation).

From a methodological point of view:

the XML tags of the initial XML-TEI TXM representation in a) can be seen as manual annotations added to the base text (or raw text), typically philologically edited with the help of specialized XML editors (like Oxygen XML Editor https://www.oxygenxml.com ) outside of TXM when the source is XML native, or as automatic annotations added by TXM when converting the sources from other digital formats (like TXT, DOC, etc.) into XML-TEI TXM. NLP tools processing results in step b) can be seen as automatic annotations added to the initial XML-TEI TXM representation of texts built in work step a); All TXM tools can then be applied indiscriminately to all types of annotations through a unified textual corpus data model regardless of their origin (automatic or manual).

Thus, so far TXM has implemented a traditional digital philology workflow combining an initial “text source encoding and annotation” step to a following “application of analysis tools on annotated texts” step. The text analysis tools use text annotations (for example word pos or some internal textual structure) to offer their services and produce their results (for example the frequency index of all infinitive verbs found in a section). The workflow is unidirectional and the whole of it must be passed through again completely if any annotation needs to be corrected. To add or correct annotations, the user has to edit the sources or the annotations outside of TXM. For example word properties can be exported from the XML-TEI TXM representation in a file in tabulated format, edited in a spreadsheet and injected back into the texts before re-import into TXM see for example this tutorial based on TXM macros: . .

This paper introduces new services developed in TXM to annotate directly texts from within the results view of specific tools for a better integration of philological and analytical work. Indeed, results views are great places to be aware of annotation errors or annotation needs, and to access what needs to be corrected or annotated.

Interactive annotation services in TXM

The three new annotation services concern both adding and correcting information, and all the annotations edited are meant for further exploitation by usual TXM tools.

Concordance based SyMoGIH entities annotation

The first service, developed in partnership with the LARHRA research laboratory in history http://larhra.ish-lyon.cnrs.fr , is based on the annotation of concordance pivots: any sequence of words composing the pivots can be annotated with any semantic category pivotscan also optionally be annotated with simple keywords or with key-value pairs, managed by TXM in a local repository. of the SyMoGIH historical knowledge base (Beretta, 2015). In this architecture, the SyMoGIH platform hosts the ontology of historic facts and knowledge, and TXM concordances provide the user interface to link identifiers of those data to text spans for further analysis.

As an illustration, see figure 1 the annotation of the “Faculté de droit d’Aix” entity (of id CoAc13562) in unverified OCRed texts of the “Bulletin administratif de l'Instruction publique" corpus see the Bibliothèque historique de l'éducation (BHE) project: .

Figure 1. TXM screenshot of a Concordance of a “Faculté de droit d’Aix” word sequence pattern to annotate (top) and of browsing SyMoGIH semantic categories to find the CoAc13562 identifier to use for the annotation (bottom).

TXM internal management of those annotations is equivalent to a re-import of the current pivot representation with the new annotations. After re-import (after saving annotations) the new annotations are available for all TXM tools to work on like any original “annotation” of the texts (with internal textual structures and their properties).

Concordance based word properties annotation

The second service is based on the annotation of words of concordance pivots: a word present in the pivots in TXM, pivots of concordances can be composed of a sequence of words. of a concordance can be annotated with any property. The primary goal of that service is to annotate and correct pos and lemma properties of words, but it can help to annotate any property at the single word level.

As an illustration, see figure 2 the correction of the “pos” property of some “vers.” words used in biblical references in Hobbes works lemmatized by Morphadorner (Burns, 2013).

Figure 2. TXM screenshot of a Concordance to set the “pos” property to the “n-ab” value of two occurrences of the "vers." word, selected by their concordance line.

TXM internal management of those annotations is equivalent to a re-import of the current pivot representation with new annotations encoded in XML-TEI TXM at the word level.

Full text URS annotation in text edition

The third annotation service is based on manual annotation of sequence of words inside text editions with elements of a Unit-Relation-Schema (URS) annotation model (Widlöcher & Mathet, 2009). URS type annotations are designed to encode complex discourse entities like co-reference chains in texts (Schnedecker et al., 2017).

As an illustration, see figure 3 the annotation of the “ses loix” sequence of words with a URS unit of type MENTION, having its grammatical category to the value “GN.POS” and its referent to the value “les lois de la divinité”, in the first chapter of the 1755 edition of De l'esprit des lois by Montesquieu.

Figure 3. TXM screenshot displaying the first page of an edition of De l'esprit des lois highlighting in light yellow all URS units of type MENTION and in bold the unit currently selected (top window), and displaying the current unit properties value input form (bottom window): CATEGORY property at value “GN.POS”...

TXM import/export management services represent those annotations as XML-TEI stand-off annotations anchored to the word elements of the XML-TEI TXM representation of texts (Grobol et al., 2018).

Discussion

By using a common XML-TEI TXM pivot representation for internal management of corpora for all the annotation services, TXM unifies transcription, encoding and annotation activities in a single framework. In this framework, annotations can represent manual (user), semi-automatic (machine+user) or automatic (machine) interpretation results used further for analysis and interpretation work. The reflexive nature of the resulting text analysis workflow is schematized in figure 4. Texts are first digitized by OCR, transcribed or converted from digital formats. They are then possibly philologically corrected and established through XML-TEI manual encoding. Then automatically processed by NLP tools while being imported into TXM to produce the TXM internal corpus model. Corpus analysis is then assisted by TXM tools applied to the corpus model. The pivot representation that gathers all annotations produced by annotation tools is figured as the node labeled « Pivot rep. » and the interpretation workflow itself is figured as a digital hermeneutic circle.

Figure 4. Digital hermeneutic circle integration into TXM.

Legend

blue box: manual annotation activity black box: tool red box: automatic annotation activity green box: TXM corpus data model purple disk: data representation black arrow: activity green arrow: annotation equivalence
Conclusion

All the new annotation services integrated into TXM are building a comprehensive annotation-based digital text corpora analysis platform. From an epistemological point of view, the integration in TEI of the different annotation models and tools into the platform helps its users to better define and trace what comes from the source corpus they analyze and what comes from their own or from others interpretation work.

This work was funded by the ANR and the DFG under grant numbers ANR-15-CE38-0008 (DEMOCRAT project) and ANR-14-FRAL-0006 (PaLaFra project).

Bibliography Beretta, F. (2015). Publishing and sharing historical data on the semantic web : the SyMoGIH project – symogih.org. Presented at the Workshop: Semantic Web Applications in the Humanities. Retrieved from https://halshs.archives-ouvertes.fr/halshs-01136533. Burns, P. R. (2013). MorphAdorner v2: A Java Library for the Morphological Adornment of English Language Texts. Evanston, IL. Northwestern University. Grobol, L., Landragin, F. and Heiden, S. (2018). XML-TEI-URS: using a TEI format for annotated linguistic resources. CLARIN Annual Conference 2018. Pisa, Italy https://hal.archives-ouvertes.fr/hal-01827563. Heiden, S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In Otoguro, R., Ishikawa, K., Umemoto, H., Yoshimoto, K. and Harada, Y. (eds), 24th Pacific Asia Conference on Language, Information and Computation – PACLIC24. Sendai: Institute for Digital Enhancement of Cognitive Development, pp. 389-98. <https://halshs.archives-ouvertes.fr/halshs-00549764/en> (accessed 15 April 2019). Schmid, H. (1994). Probabilistic Part-Of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Language Processing (Vol. 12). Schnedecker, C., Glikman, J., and Landragin, F. (2017). Les chaînes de référence : annotation, application et questions théoriques. Langue française, (195), 5–16. https://doi.org/10.3917/lf.195.0005 TEI Consortium. (2017). TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium. Retrieved from http://www.tei-c.org/Guidelines/P5 Widlöcher, A., and Mathet, Y. (2009). La plate-forme Glozz: environnement d’annotation et d’exploration de corpus. In Actes de la 16e Conférence Traitement Automatique des Langues Naturelles (TALN’09), session posters (p. 10). Senlis, France, France. Retrieved from https://hal.archives-ouvertes.fr/hal-01011969