<title type="main">Thematic complexity

1. Introduction

Readers of literary texts have an impression of its complexity, and professional readers like literary critics often use if not the term, but the concept as a basis for their evaluation of literary texts. Recent attempts in literary studies to explore the notion had difficulties agreeing on a definition (Koschorke 2017), a fate which we try to avoid by relying on a very general understanding of complexity: the number of elements and the number and quality of their relations. When we are talking about texts these elements can be words, syllables, metaphors, intertextual relations, the syntax of sentences, themes and topics etc. In this perspective text complexity becomes a multi-dimensional phenomenon. We should emphasize that we are talking about a set of textual features and not about the difficulty for a reader to process the text, which is the domain of cognitive reader studies. Our attempt in this paper to describe a useful approach to thematic complexity in fiction is part of our ongoing research on the complexity of fiction. In Jannidis et. al. (2019) we looked at measures of vocabulary richness and syntactic complexity and applied them to the same collection of popular genre novels and the collection of literature of renowned German writers we are analyzing in this paper. Surprisingly with the exception of sentence length there is no single measure which allows to distinguish consistently between these two collections. This study now looks at thematic complexity. Obviously there is no limit to the themes and topics a novel can deal with, but an infinite amount is difficult to measure. So we use the mixture of genres in documents as a proxy for thematic complexity, and we measure this mixture using topic modeling and Zeta to describe the genres.

2. Cooperation with the German National Library

The analysis is based on the digital texts delivered to the German National Library (DNB). The DNB collects, archives, records and makes available to the public the media works published in Germany since 1913 as well as the German-language media works published abroad. Since 2006, the collection of media works published online has also been one of DNB's tasks.

DNB's holdings currently comprise of more than 5 million digital objects, including some 900,000 e-books, around 1.5 million e-journal editions and about 2 million e-paper editions. In addition to the extensive physical inventory, DNB users have a growing pool of "born digital" objects at their disposal.

The cooperation with the DH communities is a strategically important continuation of DNB's range of services. One aspect is the support of selected cooperation partners through the provision of even extensive corpora, primarily from born digital objects such as e-books, for carrying out automatic analyses in the DNB labs on site.

3. Corpus

Our corpus holds 9000 novels from 13 pulp fiction genres and high brow literature novels from prize-winning authors (see Fig. 1).

Novels per genre in corpus

4. Methods

We aim to compute thematic complexity on the mixture of genres in documents. Since genre labels for documents are provided we will use these to identify word fields covering stereotypic themes, topoi or motives in genres. For this task, we use Zeta (Schöch et al., 2018) and LDA (Blei et al., 2003) . In a second step, the words are used to measure a distribution of genres in each novel.

4.1 Zeta

Originally, Zeta is used to determine which words are well suited for distinguishing two groups of texts, for example, texts by two authors. However, we are only interested in which words are characteristic of a group. This is why we compare a homogeneous group of texts of one genre with a heterogeneous group to represent the entire corpus.

The zeta scores of word x for each genre are multiplied by the frequency it appears within a document. To represent a document by a single vector, the average scores overall words are computed.

We will use variants of zeta as listed by Schöch et al. (2018), see Table 1.

Table 1: Variants of zeta from (Schöch et al. 2018) document proportions relative frequencies no transformation

log2-

Transformation

no Transformation

log2-

Transformation

Subtraction sd0 sd2 sr0 sr2 Division dd0 dd2 dr0 dr2

4.2 Topic Model

We use the LDA implementation Mallet (McCallum, 2002) to compute Topic Models with 50, 100, 150, 200 and 250 topics by 1000 iterations over our corpus. We use the same stopwords as in the zeta pipeline. For each topic a genre distribution is calculated from genre labels and this distribution is used to infer the distribution of genre in documents via their share of topics (see Fig.2). Because every topic contains all words and every document inherits all topics to some extent we set a threshold of 5% proportion for topics in documents to be taken into account.

From Topic Model to genre distribution

4.3 Complexity

Both approaches create a representation of documents as vectors with one dimension for each genre. To measure thematic complexity, we will use the Gini-Index (Giovanni Bellù and Liberati 2006). This measure of dispersion is commonly seen in economics to quantify inequality in income. The values for the Gini-Index lie between 0, which indicates that all incomes are equal or in our context that all genres contribute equally to this text, and 1 (for large numbers), which means that only one person has any income or in our context that only one genre contributes to the document. But in this form a higher value would actually mean lower complexity, so for better readability, we will use the inverse Gini-Index, so a lower value indicates usage of mainly one genre’s vocabulary and a higher one the occurrence of words from different genres.

4.4 Evaluation

There is no way for us to evaluate thematic complexity scores directly because creating test data by humans would be quite time-intensive and of uncertain success. Instead, we will use genre classification as a downstream task with our document representation (see 4.1, 4.2) as input feature, to at least evaluate the genre distribution in documents, on which thematic complexity is based.

5. Results

Evaluation results of genre classification (see table 2) show best results for LDA with 150 Topics. The following results, except the most distinctive words in table 3, are all based on this setup.

Table 2: f1 score (weighted) of genre classification with logistic regression Zeta Variant sd0 dd0 sd2 dd2 sr0 dr0 sr2 dr2 f1 .86 .62 .86 .83 .72 .68 .73 .77 LDA Topics 50 100 150 200 250 f1 .86 .88 .89 .86 .86

Table 3: Most distinctive words per genre (Zeta, sd2) Regional Western High Lit. Medical

tavern

cows

salute

mountain rescue

ride

gun

pasture

bridle

so-called

jewish

earth (de: Erden)

philosophy

surgery

jealous

eat

sob

Figure 3 shows the transformation of document vectors consisting of the 14 genre proportions into a 2-dimensional representation. The calculation was done with Uniform Manifold Approximation (McInnes et al. 2018).

Umap transformation of genre distribution

Figure 4 shows the distribution of genres in genres. Naturally the most prominent genre in each genre is its own class. But it shows that there are - as expected - strong intersections between royal, medical, family and love genres.

Genre proportions heatmap

Figure 5 shows the change in genre distribution of a mixed genre like western with erotic elements compared to its straight main genre, in this case western. Label for these crossover genres are provided by publishers.

Changes in genre distribution of mixed genres relative to main genre

Inverted Gini-Index of novels per genre

Figure 6 shows the inverted Gini-Index, which we propose as a measure of thematic complexity. High Literature, Science Fiction and Crime form a leading group, Love and Adventure novels show high diversity and the war genre seems to be the most monothematic group of texts.

6. Discussion and future work

We have shown two approaches to calculating genre distributions, with LDA doing slightly better in downstream evaluation than Zeta. Based on this results, we were able to use the Gini index to calculate thematic complexity. Our assumption that high literature is constituted from a broader spectrum of subject areas than pulp fiction genres is confirmed.

Because Zeta and LDA are so close to each other, it seems promising to us to combine both methods, for example, to initialize GuidedLDA with Zeta seed words.

An intermediate product of the calculation with Zeta are word vectors whose dimension represents the distinctiveness of a word for a genre. Considering this as a naive word embedding it seems almost obvious to redefine the whole workflow by learning word embeddings together with genre classification task.

Bibliography David M. Blei, Andrew Y. Ng, and Micheal I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research , 3:993–1022. Lorenzo Giovanni Bellù and Paolo Liberati. 2006. Inequality Analysis. The Gini-Index. Albrecht Koschorke. 2017. Einleitung. In Komplexität und Einfachheit , pages 1–10. Metzler, Stuttgart. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Lelland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. 2018. UMAP: Uniform Manifold Approximation and Projection. The Journal of Open Source Software , 3(29). Christof Schöch, Daniel Schlör, Albin Zehe, Gebhart Henning, Martin Becker, and Andreas Hotho. 2018. Burrows Zeta: Exploring and Evaluating Variants and Parameters. In Conference Abstracts of DH2018 , Mexico City. ADHO. Fotis Jannidis, Leonard Konle, Peter Leinen. 2019. Makroanalytische Untersuchung von Heftromanen. In Book of Abstracts of DHd2019 , Frankfurt a.M.