Interactive Similarity Analysis of Early New High German Text Variants

Interactive Similarity Analysis of Early New High German Text Variants Medek André Martin-Luther-University Halle-Wittenberg, Germany andre.medek@informatik.uni-halle.de Ritter Jörg Martin-Luther-University Halle-Wittenberg, Germany joerg.ritter@informatik.uni-halle.de Molitor Paul Martin-Luther-University Halle-Wittenberg, Germany paul.molitor@informatik.uni-halle.de Kösser Sylwia Martin-Luther-University Halle-Wittenberg, Germany sylwia.koesser@germanistik.uni-halle.de 2014-12-19T13:50:00Z Paul Arthur, University of Western Sidney

Locked Bag 1797 Penrith NSW 2751 Australia Paul Arthur

Converted from a Word document

DHConvalidator Paper Long Paper genetic edition manuscript text comparison visualization early new high german corpora and corpus activities encoding - theory and practice lexicography software design and development text analysis programming german studies visualisation linking and annotation English

In this paper we present an intuitive and interactive visualisation of comparison results on Early New High German manuscript records supporting the scholar to determine clusters of similar records. Based on detailed comparison results generated by the framework LAKOMP, an easily accountable similarity graph is dynamically generated. By selecting portions of text and manuscript records of interest, the scholar can explore the commonalities and divergences by witnesses linked to the graph elements.

* * *

Philologists studying the genetic origin of texts with multiple available records face the challenge of identifying genetically close records. To the end, for every two records the witnesses for similarities or divergence have to be listed. Done manually, a pair-wise comparison of records is a tedious task that would vastly benefit from automatic tool support. Here we present such tool support for Early New High German manuscripts. Searching for similarities and divergences between manuscripts of this stage of the German language is a particular challenge because of the lack of a common orthography. Actually, our tool is a semiautomatic multistage approach that leads to an interactive similarity graph that represents the similarities of records in an intuitive and transparent way. This graph gives an overall view of the relationships between the different records. Together with the possibility to ‘jump’ to the witnesses for similarities, it allows philologists to confirm or confute previously formulated hypotheses about the clustering of records. This similarity graph is based on witnesses for similarities of records identified by our framework LAKOMP (Medek et al., 2014; 2015) that is being developed in the interdisciplinary project Semi-automatic Difference Analysis of Complex Text Variants (SaDA), in which Germanists, Romanists, and computer scientists develop software tools for supporting philological text comparison projects. Existing tools for comparing multiple text records such as Juxta (Juxta, 2014) or collateX (Dekker and Middell, 2011) do offer visualisations for their comparison results also, but lack a good scalability for large amounts of texts and do not make use of annotation data such as lemmata, which is crucial for comparing Early New High German manuscripts.

We present our approach through the example of the 15th-century manuscript ‘Wundarznei’ by Heinrich von Pfalzpaint, a forefather of plastic surgery and one of the most famous surgeons in the late Middle Ages. While the original manuscript has been lost, 11 manuscript records written by different copyists in the 15th and 16th centuries are known. One of these records is presumed lost; the other 10 records are available to science and are now subject to a detailed comparison by Germanists at the Martin Luther University Halle-Wittenberg. Figure 1 shows a facsimile of a page of one of these records. Besides the detailed listing of differences among the records for corresponding text passages as critical apparatus, a goal of the text comparison is the grouping of records according to their similarities.

Figure 1. A facsimile of the manuscript Dresden 292 of the ‘Wundarznei’. Source: SLUB Dresden/Handschriftensammlung, Signature: [Mscr.Dresd.C292, fol. 151v].

Approach

A prerequisite for visualising similarities is a detailed comparison of the records. For an automatic comparison of the manuscripts of the ‘Wundarznei’, a particular challenge arises due to the lack of a common orthography in Early New High German. Because of very different spellings of the same word form even within a single record, an automatic identification of word forms is not possible. A founded identification de facto requires manual intervention of the philological researchers for nearly every word in the manuscript records. We meet this problem by using lemmata of the word forms instead of the word forms themselves for the comparison of the records. An appropriate semiautomatic tool for lemmatisation of Early New High German manuscripts is integrated in our framework LAKOMP (Medek et al., 2014) . It offers the philologist an intuitive, user-friendly, and very fast interface to enter lemmata for all word forms of a text. 1 Currently, it is applied in multiple philological projects.

After the lemmatisation of the manuscripts, the comparison is done in two steps, covering two levels of detail. In the first step, corresponding text passages among the records are identified. For this purpose, the records are split into small segments, e.g., sentences or parts of sentences. These segments are aligned in a tabular manner such that corresponding segments of different records are placed next to each other. In this table, each column contains consecutive segments of a record identified by its siglum. Figure 2 shows an excerpt of an alignment resulting from the first comparison step computed in LAKOMP using an alignment algorithm that is based on fingerprints of the segments and was developed within our project. 2

Figure 2. Alignment of corresponding text passages. Here segmentation and alignment can be modified. Words marked in dark grey lack annotation data such as lemma or morphological data. A click on a word opens the annotation dialogue.

In a second step, the corresponding segments are compared in detail on word level, resulting in a fine-grained alignment as shown in Figure 3. Here, each line contains a segment (as defined above) of a record. As a result, corresponding words of different records are placed in the same column. Every column in a word-level-based alignment in which two records have an entry is called a witness for the similarity of these two records. Counting the witnesses for each pair, the scholar obtains an objective measure to judge the genetic proximity of records. Pairs of records having a lot of witnesses are genetically closer to each other than pairs with only few witnesses. This information can be displayed graphically. An intuitive presentation is a similarity graph, in which each record is represented by a node. The similarities between two records are represented by an edge between their nodes. For each edge, a weight is determined by the witnesses, , e.g., by simply using the number of witnesses or by categorizing the witnesses and prioritising the numbers per category. Having annotation data such as part-of-speech tags, witnesses of word type noun can be weighted heavier than those of type conjunction. Highly weighted edges are drawn stronger and shorter than others, leading to a closer proximity of nodes. Figure 4 shows a similarity graph generated by LAKOMP applied to a large portion of the ‘Wundarznei’. As it is very crucial for scholars to understand the relation between a visualisation and the underlying data, LAKOMP allows interactive exploration and manipulation of the graph in a web browser. Current technology for dynamic web content by jQuery (jQuery, 2014) and visualisation by D3.js (D3, 2014) allow respectable and sophisticated renderings. Moving the mouse on an edge, the number of witnesses leading to this edge is displayed. By clicking on an edge, a list of all witnesses for this edge is presented, allowing the user to directly jump to the word-level-based alignments that contain the witnesses. Additionally, scholars can select the text passages and the records for which the similarity graph should be drawn.

Figure 3. Detailed comparison of records.

Preliminary Results

In the case of the ‘Wundarznei’, Germanists are furthermore interested in finding groups of records, such that close records share a group. Until now, the existence of two groups has been assumed, group A containing the records with sigla H, Bu, D3, E, Br, and group B containing the records with sigla B8, B9, D2, P, St. However, the similarity graph shown in Figure 4, generated for a section of the text over all 10 available records, shows the existence of two clusters. Each cluster is a subset of one of the assumed groups, but the three records St, E, and P cannot be assigned to exactly one of these groups with certainty, either because of shared similarities to multiple groups or because of missing text in the considered section. The generated similarity graphs are fundamental for discussions of assumed groups, being an objective measure to support or reject assumptions by the clusters found.

Funding

This work was funded by the German Federal Ministry of Education and Research (BMBF) [grant number 01UG1247] as part of the project Semi-automatische Differenzanalyse von komplexen Textvarianten under the direction of Professor Dr. Thomas Bremer, Professor Dr. Paul Molitor, Dr. Jörg Ritter, and Professor Dr. Hans-Joachim Solms.

Figure 4. A similarity graph with edges linked to witnesses.

Notes

1. LAKOMP also supports the annotation with part-of-speech tags and morphological data, which can also be used for comparing the records.

2. There are existing alignment algorithms used in statistical machine translation such as those implemented in tools like Giza++ (Liang et al., 2006) or Berkeley Aligner (Denero, 2007). These approaches compute alignments of two texts on the word level but require a huge training corpus providing pairs of corresponding sentences among the text variants. We did not use them because of their limitation to an alignment of two text variants at once, which leaves the problem of a generalization to more text variants unsolved. Another issue is their expensive training phase, which requires manually created training data for each pair of text variants.

Bibliography Dekker, R. H. and Middell, G. (2011). Computer-Supported Collation with CollateX: Managing Textual Variance in an Environment with Varying Requirements. In Maegaard (ed.)., Supporting Digital Humanities, Conference Proceedings, Copenhagen, 17–18 November 2011. Denero, J. (2007). Tailoring Word Alignments to Syntactic Machine Translation. Proceedings of the 45th Annual Meeting of ACL. D3. (2014). D3—Data-Driven Documents. http://d3js.org. jQuery. (2014). jQuery—Write Less, Do More. http://jquery.com. Juxta. (2014). Juxta—Compare—Collate—Discover. http://www.juxtasoftware.org. Liang, P., Taskar, B. and Klein, D. (2006). Alignment by Agreement. Proceedings on Human Language Technology Conference of the North American Chapter of the ACL, pp. 104–11. Medek, A., Pöckelmann, M., Bremer, Th., Solms, H. J., Molitor, P. and Ritter, J. (2015). Differenzanalyse komplexer textvarianten: Diskussion und werkzeuge. Datenbank-Spektrum, Themenheft ‘Informationsmanagement für Digital Humanities’, 15(1) (March): 25–31, http://link.springer.com/article/10.1007%2Fs13222-014-0173-y. Medek, A., Ritter, J., Molitor, P., Kösser, S. and Leipold, A. (2014). User-Friendly Lemmatization and Morphological Annotation of Early New High German Manuscripts. Digital Humanities, DH2014, Lausanne, 7–12 July 2014. SaDA. (2014). Martin Luther University Halle-Wittenberg: SaDA—Semi-automatische Differenzanalyse von komplexen Textvarianten. http://www.informatik.uni-halle.de/sada.