Converted from a Word document
Delta (Burrows, 2002) is a measure, which has already been proven as a reliable method to resolve authorship attribution problems in different languages such as English and German. However, there has been no report about the accuracy of Delta on Chinese texts so far. As such, I set an experiment to test it. The tests cover both modern and classical Chinese because of the grammatical and lexical differences between them.
First I determined whether Delta works on modern Chinese. After that I did tests on classical Chinese. At last, I tested the
I focussed on the Classic Delta in my work. Other variations of Delta like Eder’s Delta, Argamon’s Linear Delta and so on will not be tested.Dream of the Red Chamber (DRC, 红楼梦)
. The number of authors of DRC is a classic question of Chinese literary studies. The tool I used in the experiments is “stylo”, an R package introduced in the context of stylometry in 2013 (Eder, 2013). Using “stylo” I have done cluster analysis. All texts of one author should stay in one group. Misplaced texts are considered as mistakes. The more mistakes Delta makes, the less Delta is appropriate for Chinese.
Working on Chinese language processing is different compared to languages like English. The greatest challenge lies that we are unable to recognize the boundary of words because there are no spaces between words. There are two possibilities to address this problem: (i) by using a segmenter to split a text into words and select words as the textual feature, or (ii) by selecting character N-grams as the feature. Both solutions were tested here and the results are presented as a comparison.
For my first experiment I gathered 45 modern Chinese texts from 6 authors. I used the Stanford segmenter to split the texts and select both words and characters as features. The results showed that Delta is reliable (Fig. 1). With the 100 most frequent words bigrams Delta correctly identifies 38 of 45 texts. The best results, 43 of 45 texts, occur with the 200 to 700 most frequent character bigrams or most frequent words unigram.
After the tests on modern Chinese, I proceeded with my second experiment on classical Chinese. I took 4 chapters each randomly from 10 novels from the Ming and Qing Dynasties (16th to 19th century) and built a corpus of 40 documents. One problem was that the Stanford segmenter did not work anymore, because the segmentation standards are not suitable for classical Chinese. Hence the only option was to take characters as feature. The results showed that Delta also works (Fig. 2). While many mistakes occurred with characters trigrams, taking characters bigrams for the tests achieved a high level of accuracy. With 600 most frequent characters 39 of 40 documents were correctly identified.
The first two experiments confirmed Delta as a valid measure for both modern and classical Chinese. In the third experiment Delta was applied to
According to Tu’s paper (2013) the DRC under
http://cls.hs.yzu.edu.tw/hlm/read/TEXT/TEXT.ASP is „the closest to the earliest editions“, which was taken for my study.
Dream of Red Chamber (DRC)
My experiment suggested the same conclusion as the other scholars that DRC is written by two different authors (Fig. 3). The texts were divided into two groups. Red texts represent the first 80 and green texts are the rest 40 chapters. Delta also suggests that Chapter 67 is written by the second author.