Neutralising the Authorial Signal in Delta by Penalization: Stylometric Clustering of Genre in Spanish Novels José Calvo Tello jose.calvo@uni-wuerzburg.de University of Würzburg Daniel Schlor daniel.schloer@informatik.uni-wuerzburg.de University of Würzburg Ulrike Henny ulrike.henny@uni-wuerzburg.de University of Würzburg Christof Schöch christof.schoech@uni-wuerzburg.de University of Würzburg Summary We propose a way to work with the stylometric distance measure Delta to analyse the subgenre of texts written by different authors. For that, we neutralize the author signal by penalizing the texts from the same writer, allowing the texts to have their shortest distances to other authors' works. We test this method with several subcorpora of Spanish prose and a corpus of French theatre. Stylometry and Delta beyond Authorship Since John Burrows proposed it in 2002, Delta has been one of the most used and researched methods in stylometry and authorship attribution. Burrows explained it as “expression of difference, pure difference” (2002: 269) and is based on basic statistical concepts like most frequent words, z-scores and the Manhattan distance between each pair of texts.^2 Burrows closes his paper with an unanswered question about why Delta works so well. Other researchers such as Hoover (2004b: 454), Argamon (2008), Plasek (2014) or Evert et al (2015: 79) have confirmed that we are still far from being able to answer this question. This lack of understanding has not stopped the stylometric community of trying to improve Delta (Hoover 2004a; Argamon 2008; Eder 2013). Smith and Aldridge (2011) have proposed Cosine Delta which gives the best results in different languages (Jannidis et al. 2015). Since Delta is sensitive to aspects or signals like genre or period (Burrows 2002), the corpora for authorship attribution tend to be homogenous in those aspects. Research has been conducted to try to separate signals (Schöch 2013 and 2014) or selecting the words that contribute to them using Recursive Feature Elimination (Büttner and Proisl 2016). Jannidis and Lauer (2014) and Hoover (2014) show how Delta can be used to distinguish genre and periods within the works of a single author. Other researchers have used other methods such as classification (Hettinger et al., 2016; Underwood 2014) or logistic regression (Jock-ers 2013; Riddell and Schöch 2014) to similar ends. Neutralizing Author Signal in Delta Our proposal is to neutralize the author signal directly on the Delta matrix. We use a testing corpus of texts from three Spanish authors and three subgenres. Detailed information about the corpora, files, parameters and scripts is in our GitHubrepository.-We applied Cosine Delta (5000 MFW) with Stylo (Eder, Rybicki and Kestemont 2016) and visualized the resulting distance matrix with Python: Hierarchical Clustering Dendrogram (Ward) [037-1] sample index Figure 1: Dendrogram from Cosine Delta As expected, the texts are clustered by author, with sub-clusters of subgenres. The underlying Delta Matrix contains distances between all texts: [037-2] Figure 2: Cosine Delta Matrix We see a tendency of lower Delta values for documents of the same author (below 1.0) in comparison to documents of different authors (above 1.0). But what about the closest texts written by a different author? For the historical novel in column E, they are in the rows 14 and 15 and are historical novels, as well. This pattern is found for the majority of the texts. How could we cluster the texts preferring the closest text from other authors? And if we are able to neutralize the author signal, will we see noise or subgenre clusters? Our proposal is to penalize the distances between the texts of the same author (cf. Lu and Leen 2007 for penalization in image clustering), making them closer to the average distance of texts of different authors, then cluster the neutralized distance matrix and measure the cluster homogeneity by author and subgenre. We define the set of all documents by an author a as Aa, the collection containing all documents by all authors as C and total number of documents in the collection is defined as c: A[a] •— ‘ * i C = {Ai, • • • , A[n]} c:=|UC| Note that each document is in exactly one author-document set Ai. First, we calculate the average distance of texts of all pairwise different authors (in fig. 2, all the distances in black). We call this value the mean of different authors or M(C) and for this collection its value is 1.16. 53 A(di,dj) Aa , EC,a/6 EMI • (c-Ml) A [037-3] [037-4] For each author, we subtract his/her mean value from the mean of different authors M(C) - M(Aa) resulting in the difference of the author. This value represents how far the texts of a specific author are to the mean of different authors:^4 ┌────────┬──────┬──────────┐ │author │mean │difference│ ├────────┼──────┼──────────┤ │Miro │0.607 │0.552 │ ├────────┼──────┼──────────┤ │Baroja │0.669 │0.490 │ ├────────┴──────┼──────────┤ │Valle 0.752 │0.407 │ └───────────────┴──────────┘ Figure 3: Means and differences of author Third, we add the difference of the author M(C) - M(Aa) to the Delta values of text of the same author. This gives a Neutralized Delta-function as follows: ydi E A[a], dj E Ab A^U^’^ ^f°^ra/6 ^v 31 [A(