Converted from an OASIS Open Document
In this paper, we describe computer-aided authorship testing on the Middle High German (MHG) text
Apollonius von Tyrland written by Heinrich von Neustadt (HvN) in the late 13th century. Being based on a Latin original, HvN is suspected to incorporate other sources into the translation. We investigate assumptions regarding a segmentation of this text into parts supposedly tracking back to different sources. Our objective is it to provide a) clarification on the validity of this segmentation and b) on features that show the difference in origin of the segments. In particular, we distinguish between features related to content and to style.
Comparing frequency distributions over frequent words has been established as a state of the art method for contrasting style across different literary texts (cf. Eder et al. (2013)). Quite recently, (Herrmann et al., 2015) proposed to define style as a property constituted by “formal features which can be observed quantitatively or qualitatively” (p. 44). An important aspect of it is that style has to be based on observable features.
We propose to make a clear cut between content and style: To measure stylistic differences, we restrict the selection to words appearing in every text of the corpus, thus are observable in each text, assuming that this is a simple way to exclude words that are markers of content. Content words (that presumably only appear in a subset of the texts) do not contribute to this understanding of style. They, in contrast, are extracted by filtering the MFW with a stop word list containing all the function words in a language. We refer to the sets of feature words extracted for a text with
content words and
style words.
To validate this idea, we analyse five MHG texts by three authors with the R stylo package (Eder, 2013). Figure 1 shows results for the content (a) and style (b) words. The higher similarity of
Erec and
Tristan in (a) compared to
Der arme Heinrich reflects that both narratives feature knighthood as a main theme. In contrast, the narrative in
Der Arme Heinrich involves more religious themes (faith, god), which is also reflected in the frequency tables. This distinction is clearly based on content. If we focus our analysis on style words, as in (b), we see the clustering according to authors. Thus, distinguishing frequent words of a corpus in style and content words can give us better insights into the results.
Bockhoff and Singer (1911) formulated two hypotheses regarding the internal structure and origin of the ca. 21K verses long text, regarding both the overall structure (Table 1) and the internal structure of one segment (Table 2). To get an impression of which paragraphs can indeed be found as a distinctive group using content words and style words respectively, we split the text into 71 segments of equal length. These segments are then clustered with Stylo, using delta as a similarity measure.
Our baseline consists of randomly assigning distances between the segments, drawn from a uniform distribution. We sample baseline results 1000 times. The baseline results give an impression on how well an uninformed method overlaps with the hypothesis classification introduced in Section 3. Comparing these to the results of our methods can inform us on which method goes in line with the hypothesis as opposed to a random classification.
Regarding the first hypothesis (Table 3), we observe that for both feature sets the F-score lies above the baseline for all parts except the third. This seems reasonable since this part is suspected to be based on different sources and therefore might be more heterogeneous both in content and in style. Style seems to be more homogeneous (F-Score above baseline) throughout the entire text whereas content seems to be heterogeneous especially in the adventure part introduced by HvN (F-Score below baseline). This is in line with the hypothesis, considering that HvN’s insertions report on different adventures.
Analysing these heterogeneous parts further (second hypothesis, Table 4), we see heterogeneity in terms of content for all but one part,
The duel in Syria.
The duel in Syria seems homogeneous in style whereas
The adventures in Galacites shows tendencies towards a heterogeneous style.
Both feature sets show similar tendencies and support a major part of the hypotheses by Bockhoff and Singer (1911) regarding parts suspected as insertions. Nevertheless, differences in content cannot clearly confirm the suspicion that HvN incorporated other sources. He might have created additional adventures by himself. Bockhoff and Singer (1911) do not cite sources from which HvN copied narratives, making it difficult to tackle exactly. Overall differences in style are much less significant than differences in content, which is in line with the hypothesis.