Converted from a Word document
Long-range correlations in texts – emerging even in dictionaries and allowing to differentiate genres (Montemurro and Pury, 2002) – prove that structures larger than necessitated by syntax exist. They might reflect organisation of literary works, and be one of authorial fingerprints.
Stylometry, however, have not exploited the information carried by memory longer than one clause apart – other than use of n-grams (e.g., Eder, 2011). Existing studies include investigating: sequences of (un)stressed syllables (Pawłowski, 1998; 1999); sentence lengths (Drożdż et al., 2016); transferring long-range correlations between letter and word sequences (Altmann et al., 2012).
To quantify correlations in a script, successive symbols can be treated as a time-series with symbolic values (Stanley, 1992) or numeric positions on a word frequency list (Montemurro and Pury, 2002; Ausloos, 2012), see Fig. 1. Below, I use information extracted from such time-series as features in machine learning (ML) methods to increase accuracy of authorship attribution (AA) in a benchmark literary corpus.
For a series, defined as ranks of words at consecutive positions in a text, following measures are used (formatted as
Quantity:
ML features):
Power spectrum:
psLen,
psExp
Power spectrum
S(f) of a series at a frequency
f can be interpreted as the strength of correlation of the series with itself at word-to-word distances
1/f. As Fig. 2 illustrates, it is described well by two parameters: the length
psLen of the high-correlation plateau and the slope of its decay
psExp.
Predictability:
pred
As the name suggests, it measures how well can the next step in a series be predicted given previous steps (see definition: Stone, 1996).
Fano factor exponent:
fanoF
Fano factor measures signal autocorrelation, especially in fractal processes (Thurner et al., 1997), as one takes increasingly bigger chunks of text – similarly to the slope of power spectrum or detrended fluctuation analysis (Grabska-Gradzińska et al., 2013).
Entropy rate of word variation:
entExp,
entConst
The entropy is maximal for equiprobable word occurences, and minimal when a single word is always used. As one reads a text, new words appear and the entropy grows, and saturates.
entExp and
entConst are characteristic time and a multiplicative constant of such a growth.
Static entropy:
entLocM,
entLocSD
For a window of a constant length moving across the whole text, the entropy fluctuates. Parameters
entLocM and
entLocSD are its mean and standard deviation.
AA was performed with the R package
stylo (Eder et al., 2013) with settings: delta distance (Burrows, 2002), 1000-fold cross-validation, one book of each author in the training set. (None of the other ML methods (Stamatatos, 2009; Jockers and Witten, 2010) implemented in
stylo did significantly better than Burrows’s delta.)
Since on this corpus about 90 most frequent words (MFW) are needed for 100% accuracy, only the first ten were used as features, which left room for improvement. Having precomputed all the eight measures, they were appended to the feature list.
A corpus (Rybicki, 2015) comprising 27 classic British 19
th c. novels of 11 authors was used (see Fig. 3, where each leaf is a shorthand for a novel’s
Author_Title). The reason for choosing this corpus is that many AA algorithms have been tested on it, and they perform very well, not least thanks to its size.
AA algorithms at best use 6-grams (Eder, 2011), whereas the correlations may reach hundreds of words, as demonstrated in Fig. 2. The results in Tables 1-2 show that the measures from Sec. 2.1 can aid ML. As a proof of concept, Fig. 3 shows a cluster analysis based exclusively on these complexity measures; although imperfect, it strongly indicates that the temporal characteristics contain traces of authorial style.
Surprisingly,
psLen is not correlated with paragraph lengths (cf. Kosmidis et al., 2006). Its smallest values 280-300 come, intriguingly, from Austen and Anne and Emily Brontë, while the largest 370-390 from Dickens, Thackeray and Trollope.
Note that correlated features (see Tab. 3 for a summary) worsen performance and should be eliminated. Remaining parameters are expected to contain non-overlapping information. Further, PCA analysis showed that
psLen and
entLocM contain the most distinctive information. Tables 1-2 show that indeed these parameters most significantly aid ML.
This preliminary study shows that measures reflecting long-range word-to-word correlations carry authorial information and enhance stylometric ML methods. More complex features than words and n-grams are needed.
I thank Maciej Eder for insightful discussions. The research was funded by Grant No. DEC-2013/09/N/ST6/01419 of the National Science Centre of Poland.