Converted from a Word document
The proposed paper reports on philological and computational aspects of building and annotating a literary corpus. As part of the ongoing corpus-stylistic project Q-LIMO (Quantitative Analysis of Literary Modernity), the KARREK (Kafka/Referenzkorpus), a corpus that centers on German narrative Literature, is designed to enable comparative quantitative-stylistic analyses of Franz Kafka’s prose. The corpus includes literary as well as non-literary texts and is tagged for part-of-speech (POS) and enriched by selected types of meta-data (e.g., author, title, date of publication, and [narrative] genre). The number of digital German literary text collections has been growing, and there are several repositories, such as the TextGrid Repository, the German Text Archive (DTA), as well as Gutenberg-DE and individual digital collections (e.g., the Central Catalogue of Digitized Prints [zvdd]). However, so far, no digital resource has been specifically tailored to the needs of quantitative Kafka-research. We offer such a resource, KARREK, which covers the entirety of Kafka’s writings (including literary and non-literary ones) as well as a meaningful selection of Newer German Literature, putting a focus on Modernism (ca. 1880 – 1930). What is more, it caters to consistent and high-quality linguistic annotation, text-markup, and metadata. The corpus is hence a unique resource; it will be made publicly available.
Building and annotating the KARREK poses unique challenges typical for textual analysis in Digital Humanities: The first main task is a philological one, selecting the texts that allow meaningful and hypothesis-driven analyses of Franz Kafka’s writings. Philological standards of corpus construction are especially high in terms of editorial detail and consideration of cultural, societal and philosophical contexts (cf., Engel and Auerochs, 2010) as well as linguistic ones (cf. Blahak, 2015). KARREK is hence balanced for factors such as canonicity, popularity, (narrative) genre, and linguistic variety. It strives for clarity in terms of literary edition. A special feature of the corpus is its representation of Kafka’s reading habits: we assume that a writer’s texts may reflect a dimension of ‘constructive reading’ (‘produktive Rezeption’, cf. Grimm, 1977), resulting in stylistic similarity with particular authors. For corpus compilation, we thus systematically review the available sources to determine which literary and non-literary works may be candidates for having influenced his writings (e.g., Blank 2001; Born 1990; Born et al., 1983), with ensuing data analysis to explore Kafka’s position within a global network of texts and authors.
Our aim to form a reference corpus (with some degree of ‘representativeness’ for Newer German Literature, Prague Literature, Modern Literature, etc.) requires transforming a substantial amount of ‘big literary data’ into a consistent corpus. The philological questions transform into computational tasks where retro-digitization (and OCR) is concerned, as well as preprocessing of the data, with consistent metadata (available literature on meta-data (e.g., Alemuh and Brett, 2015; Lazinger, 2001), as well as a uniform textual format, including parameters of textual markup (TEI-5). The entire corpus is going to be published in a flexible stand-off format (TCF) which allows further annotation.
For our means, the second main task is the reliable and accurate linguistic annotation for POS (with the STTS tag-set for German). Word class is a reliable indicator of register and genre variation (e.g., Biber and Conrad, 2009), with a high degree of accurate automatic annotation. Although there are high-quality POS-taggers available for German (e.g., Unigram, HMM, and Perceptron Taggers), most were trained on late 20th-Century news texts. To ensure accurate tagging of historically prior and literary texts, for training our taggers, we use as a first resource the POS-tagged corpus of the German Text Archive (DTA) which comprises historical and literary data. In addition, a round of manual error management (on a sample of ca. 40,000 words), as well as training of a CRF-tagger (MarMot) are administered to heighten reported accuracy of POS-tagging for our corpus. We are currently establishing a workflow comprising an eXist-database with an annotation interface, as well as a procedure for evaluation of tagger output for different sections of the corpus. Next to the compilation of the corpus, our project will thus offer more insight into POS-tagging of historical German narrative literary texts.
In DH, tackling representativeness in literary corpus building is an important research task, with theoretical background offered by corpus linguists and computational linguists (e.g., Biber, 1994; Evert, 2006; Lee, 2001). Although methods of ‘big data’ computing (cf. Anderson, 2008; Manovich, 2015) are extremely promising for digital literary studies, we suggest that in our field creating balanced and representative corpora is a vital question. Since there are only few best practice examples (for German, e.g. the DTA), and representativeness is a recurring issue that by no means may be considered ‘once-and-for-all’-solved, the way in which our particular study tackles the issue may be of interest for many similar studies. In the prospective talk, we will thus address ‘representativeness’, reporting on the decisions made on text selection (author, titles, genre, canon/popularity, ‘constructive reading’). We will also provide a description of the linguistic (POS) annotation, including details on our machine-learning procedure and the user interface. Also, the workflow will be presented, elucidating ways of applying it to analogous studies/projects. The talk will give a description of the online publication details. In all, our research shows that building and enriching a particular literary corpus is by far no trivial task, but requires a sound theoretical modeling of the phenomenon constructed and an interdisciplinary method that does justice to philological as well as to computational criteria of high-quality corpus research.