Converted from an OASIS Open Document
TXM is a software platform offering textual corpora analysis tools and services. It is delivered as a standard desktop application for Windows, Mac and Linux and as a web portal server application (Heiden, 2010), <
TXM provides a consistent set of analysis tools combining qualitative (or close reading) tools such as word frequency lists, concordancing or text edition hypertextual navigation with synthetic quantitative (or distant reading) tools such as factorial analysis, clustering, keywords or co-occurrence statistical analysis.
To work on texts, the platform first imports the corpus sources to create a rich XML-TEI based internal pivot representation via the following general workflow:
first the “base text” of each text is established: this operation implements “digital philology” principles and consists of decoding information in the various formats of the source documents to decide primarily where are the text limits, possible internal textual structures boundaries and the words of the text.
To do this, TXM can analyze and represent three main types of corpora:
The result of this operation is represented in a pivot XML format especially designed for TXM called “XML-TEI TXM” extending the standard encoding recommendations of the Text Encoding Initiative consortium (TEI Consortium, 2017);
From a methodological point of view:
Thus, so far TXM has implemented a traditional digital philology workflow combining an initial “text source encoding and annotation” step to a following “application of analysis tools on annotated texts” step. The text analysis tools use text annotations (for example word pos or some internal textual structure) to offer their services and produce their results (for example the frequency index of all infinitive verbs found in a section). The workflow is unidirectional and the whole of it must be passed through again completely if any annotation needs to be corrected. To add or correct annotations, the user has to edit the sources or the annotations outside of TXM. For example word properties can be exported from the XML-TEI TXM representation in a file in tabulated format, edited in a spreadsheet and injected back into the texts before re-import into TXM
This paper introduces new services developed in TXM to annotate directly texts from within the results view of specific tools for a better integration of philological and analytical work. Indeed, results views are great places to be aware of annotation errors or annotation needs, and to access what needs to be corrected or annotated.
The three new annotation services concern both adding and correcting information, and all the annotations edited are meant for further exploitation by usual TXM tools.
The first service, developed in partnership with the LARHRA research laboratory in history
http://larhra.ish-lyon.cnrs.fr
As an illustration, see figure 1 the annotation of the “Faculté de droit d’Aix” entity (of id CoAc13562) in unverified OCRed texts of the “Bulletin administratif de l'Instruction publique" corpus
Figure 1. TXM screenshot of a Concordance of a “Faculté de droit d’Aix” word sequence pattern to annotate (top) and of browsing SyMoGIH semantic categories to find the CoAc13562 identifier to use for the annotation (bottom).
TXM internal management of those annotations is equivalent to a re-import of the current pivot representation with the new annotations. After re-import (after saving annotations) the new annotations are available for all TXM tools to work on like any original “annotation” of the texts (with internal textual structures and their properties).
The second service is based on the annotation of words of concordance pivots: a word present in the pivots
As an illustration, see figure 2 the correction of the “pos” property of some “vers.” words used in biblical references in Hobbes works lemmatized by Morphadorner (Burns, 2013).
Figure 2. TXM screenshot of a Concordance to set the “pos” property to the “n-ab” value of two occurrences of the "vers." word, selected by their concordance line.
TXM internal management of those annotations is equivalent to a re-import of the current pivot representation with new annotations encoded in XML-TEI TXM at the word level.
The third annotation service is based on manual annotation of sequence of words inside text editions with elements of a Unit-Relation-Schema (URS) annotation model (Widlöcher & Mathet, 2009). URS type annotations are designed to encode complex discourse entities like co-reference chains in texts (Schnedecker et al., 2017).
As an illustration, see figure 3 the annotation of the “ses loix” sequence of words with a URS unit of type MENTION, having its grammatical category to the value “GN.POS” and its referent to the value “les lois de la divinité”, in the first chapter of the 1755 edition of
De l'esprit des lois by Montesquieu.
Figure 3. TXM screenshot displaying the first page of an edition of
De l'esprit des lois highlighting in light yellow all URS units of type MENTION and in bold the unit currently selected (top window), and displaying the current unit properties value input form (bottom window): CATEGORY property at value “GN.POS”...
TXM import/export management services represent those annotations as XML-TEI stand-off annotations anchored to the word elements of the XML-TEI TXM representation of texts (Grobol et al., 2018).
By using a common XML-TEI TXM pivot representation for internal management of corpora for all the annotation services, TXM unifies transcription, encoding and annotation activities in a single framework. In this framework, annotations can represent manual (user), semi-automatic (machine+user) or automatic (machine) interpretation results used further for analysis and interpretation work. The reflexive nature of the resulting text analysis workflow is schematized in figure 4. Texts are first digitized by OCR, transcribed or converted from digital formats. They are then possibly philologically corrected and established through XML-TEI manual encoding. Then automatically processed by NLP tools while being imported into TXM to produce the TXM internal corpus model. Corpus analysis is then assisted by TXM tools applied to the corpus model. The pivot representation that gathers all annotations produced by annotation tools is figured as the node labeled « Pivot rep. » and the interpretation workflow itself is figured as a digital hermeneutic circle.
Figure 4. Digital hermeneutic circle integration into TXM.
Legend
All the new annotation services integrated into TXM are building a comprehensive annotation-based digital text corpora analysis platform. From an epistemological point of view, the integration in TEI of the different annotation models and tools into the platform helps its users to better define and trace what comes from the source corpus they analyze and what comes from their own or from others interpretation work.
This work was funded by the ANR and the DFG under grant numbers ANR-15-CE38-0008 (DEMOCRAT project) and ANR-14-FRAL-0006 (PaLaFra project).