Converted from an OASIS Open Document
Readers of literary texts have an impression of its complexity, and professional readers like literary critics often use if not the term, but the concept as a basis for their evaluation of literary texts. Recent attempts in literary studies to explore the notion had difficulties agreeing on a definition (Koschorke 2017), a fate which we try to avoid by relying on a very general understanding of complexity: the number of elements and the number and quality of their relations. When we are talking about texts these elements can be words, syllables, metaphors, intertextual relations, the syntax of sentences, themes and topics etc. In this perspective text complexity becomes a multi-dimensional phenomenon. We should emphasize that we are talking about a set of textual features and not about the difficulty for a reader to process the text, which is the domain of cognitive reader studies. Our attempt in this paper to describe a useful approach to thematic complexity in fiction is part of our ongoing research on the complexity of fiction. In Jannidis et. al. (2019) we looked at measures of vocabulary richness and syntactic complexity and applied them to the same collection of popular genre novels and the collection of literature of renowned German writers we are analyzing in this paper. Surprisingly with the exception of sentence length there is no single measure which allows to distinguish consistently between these two collections. This study now looks at thematic complexity. Obviously there is no limit to the themes and topics a novel can deal with, but an infinite amount is difficult to measure. So we use the mixture of genres in documents as a proxy for thematic complexity, and we measure this mixture using topic modeling and Zeta to describe the genres.
The analysis is based on the digital texts delivered to the German National Library (DNB). The DNB collects, archives, records and makes available to the public the media works published in Germany since 1913 as well as the German-language media works published abroad. Since 2006, the collection of media works published online has also been one of DNB's tasks.
DNB's holdings currently comprise of more than 5 million digital objects, including some 900,000 e-books, around 1.5 million e-journal editions and about 2 million e-paper editions. In addition to the extensive physical inventory, DNB users have a growing pool of "born digital" objects at their disposal.
The cooperation with the DH communities is a strategically important continuation of DNB's range of services. One aspect is the support of selected cooperation partners through the provision of even extensive corpora, primarily from born digital objects such as e-books, for carrying out automatic analyses in the DNB labs on site.
Our corpus holds 9000 novels from 13 pulp fiction genres and high brow literature novels from prize-winning authors (see Fig. 1).
We aim to compute thematic complexity on the mixture of genres in documents. Since genre labels for documents are provided we will use these to identify word fields covering stereotypic themes, topoi or motives in genres. For this task, we use Zeta
(Schöch et al., 2018)
and LDA
(Blei et al., 2003)
. In a second step, the words are used to measure a distribution of genres in each novel.
Originally, Zeta is used to determine which words are well suited for distinguishing two groups of texts, for example, texts by two authors. However, we are only interested in which words are characteristic of a group. This is why we compare a homogeneous group of texts of one genre with a heterogeneous group to represent the entire corpus.
The zeta scores of word
x
for each genre are multiplied by the frequency it appears within a document. To represent a document by a single vector, the average scores overall words are computed.
We will use variants of zeta as listed by Schöch et al. (2018), see Table 1.
We use the LDA implementation Mallet
(McCallum, 2002)
to compute Topic Models with 50, 100, 150, 200 and 250 topics by 1000 iterations over our corpus. We use the same stopwords as in the zeta pipeline. For each topic a genre distribution is calculated from genre labels and this distribution is used to infer the distribution of genre in documents via their share of topics (see Fig.2). Because every topic contains all words and every document inherits all topics to some extent we set a threshold of 5% proportion for topics in documents to be taken into account.
Both approaches create a representation of documents as vectors with one dimension for each genre. To measure thematic complexity, we will use the Gini-Index (Giovanni Bellù and Liberati 2006). This measure of dispersion is commonly seen in economics to quantify inequality in income. The values for the Gini-Index lie between 0, which indicates that all incomes are equal or in our context that all genres contribute equally to this text, and 1 (for large numbers), which means that only one person has any income or in our context that only one genre contributes to the document. But in this form a higher value would actually mean lower complexity, so for better readability, we will use the inverse Gini-Index, so a lower value indicates usage of mainly one genre’s vocabulary and a higher one the occurrence of words from different genres.
There is no way for us to evaluate thematic complexity scores directly because creating test data by humans would be quite time-intensive and of uncertain success. Instead, we will use genre classification as a downstream task with our document representation (see 4.1, 4.2) as input feature, to at least evaluate the genre distribution in documents, on which thematic complexity is based.
Evaluation results of genre classification (see table 2) show best results for LDA with 150 Topics. The following results, except the most distinctive words in table 3, are all based on this setup.
Figure
3
shows the transformation of document vectors consisting of the 14 genre proportions into a 2-dimensional representation. The calculation was done with Uniform Manifold Approximation
(McInnes et al. 2018).
We have shown two approaches to calculating genre distributions, with LDA doing slightly better in downstream evaluation than Zeta. Based on this results, we were able to use the Gini index to calculate thematic complexity. Our assumption that high literature is constituted from a broader spectrum of subject areas than pulp fiction genres is confirmed.
Because Zeta and LDA are so close to each other, it seems promising to us to combine both methods, for example, to initialize GuidedLDA with Zeta seed words.
An intermediate product of the calculation with Zeta are word vectors whose dimension represents the distinctiveness of a word for a genre. Considering this as a naive word embedding it seems almost obvious to redefine the whole workflow by learning word embeddings together with genre classification task.