Converted from an OASIS Open Document
In fictional prose narrative such as novels and short stories, various forms of speech, thought, and writing representation are ubiquitous and have been studied in great detail in linguistics and literary studies. However, beyond quotation marks, what are linguistic markers of direct speech? And just how ubiquitous is direct speech really? Is there systematic variation in the amount of direct speech over time or across genres? Especially for the field of French literary history, where typography is not a reliable guide, we really don't know.
This is regrettable, because being able to quickly and automatically detect direct speech in large collections of literary narrative texts is highly desireable for many areas in literary studies. In the history of literary genres, this allows to observe distributions and evolutions of a fundamental, formal aspect of the novel on a large scale. In narratology, differentiating narrator from character speech is a precondition for more detailed analyses of narrator speech, e.g. with regard to text type (descriptive, narrative, argumentative text). And in authorship attribution, it hereby becomes possible to discard character speech from a set of novels and perform authorship attribution on the narrator speech only, something which may improve attribution.
Against this background, the work presented here addresses both the question of how to identify direct speech in French prose fiction and that of how prevalent direct speech is in different subgenres of the nineteenth-century French novel.
Our first aim has been to use machine learning to automatically identify direct character speech in a small collection of French-language fictional prose. This is less trivial than it seems to be since in the French typographical tradition, direct speech is usually not marked with opening and closing quotation marks (figure 1). Rather, a long hyphen usually indicates the beginning of direct speech, whereas the end is left unmarked. In figure 1, the first highlighted direct speech continues after the insertion revealing who has just spoken (“lui dit-il, tout bas,”;
he quietly said to him). In the second example, the direct speech ends after the speaker has been indicated (“dit une voix à la portière”;
said a voice at the door). Our hypothesis is that there are enough linguistic markers of direct speech to make it possible to identify it automatically and reliably (for an overview of such markers, see Durrer, 1994).
Our second aim has been to use the best-performing algorithm to identify direct speech in a larger collection of French nineteenth-century novels and to study its distribution. Here, we hope to detect significant differences in the proportion of direct speech found in different novelistic subgenres. Research about other literary traditions supports this hypothesis (e.g. Allison et al., 2011).
Speech, thought and writing representation are common topics in narratology and stylistics (Genette, 2008, Leech/Short, 2007). Semino and Short’s 2004 quantitative study finds that direct representation is clearly the most frequent type in their English fiction sub-corpus. Brunner 2015 confirms this trend for her corpus of German short narratives. Here, the percentage of sentences containing direct speech is about 35% and varies widely over different texts (2%-72%).
Frequently, speech representation recognition is an auxiliary step to other tasks, e.g. knowledge extraction or speaker recognition (Krestel at al., 2008, Elson and McKeown, 2010, Iosif and Mishra, 2014, Sarmento/Nunes, 2009). Weiser and Watrin 2012 used a rule-based approach to extract unmarked quotations in French newspaper texts with success rates of 0.745-0.789. Brunner 2015 focuses on speech, thought and writing representation in German short literary narratives. Using machine learning with random forests, she reports an F1 score of 0.87 for direct speech in a sentence-based cross-validation.
Our text collection contains 127 French novels published between 1840 and 1889. Three generic subsets can be distinguished, each of which is represented by approximately 40 texts: general novelistic fiction (so-called ‘littérature blanche’) is contrasted with specific subgenres, crime fiction (‘policier’) and fantastic novels (see figure 2). The narrative perspective is largely heterodiegetic.
To obtain a gold standard, 40 chapters from 20 different novels were randomly chosen from the collection and annotated manually. 5734 sentences were marked as either containing direct speech or not containing any direct speech; the former also include mixed sentences.
To prepare feature generation, preprocessing was performed, the pipeline consisting of the Stanford CoreNLP-Tokenizer and Sentence-Splitter, as well as the TreeTagger for POS-Tagging and Lemmatization.
We modeled 81 features which we believed to be useful cues for the classification task (see the annex for a ranked list). Features are generated on a sentence-based level and can be divided into different categories:
For the binary classification task (sentences containing vs. not containing direct speech), we used an annotation and classification framework developed by Markus Krug (Würzburg) wrapping LibSVM Support-Vector-Machine (Chang and Lin, 2011), Maximum Entropy (Nigam et al., 1999) and Naïve Bayes (John and Langley, 1995) and implemented in MALLET (McCallum 2002). Random Forest (Breiman, 2001) and JRip (Cohen, 1995) were applied using Weka. All experiments were validated using 10-fold cross-validation unless otherwise stated.
The machine learning algorithms’ incorrect assignments on the gold standard (false positives and false negatives) were manually analyzed in order to detect the errors' underlying causes.
Using the best-performing model, all sentences in the text collection were tagged for containing direct speech or not. The distribution of ratios of direct speech / non-direct speech was calculated for the three subgenres and five decades covered by the collection. Performance on these unseen texts was checked manually on a random sample. (We sampled 2300 sentences, i.e. 100 random sentences each from a sample of 23 novels stratified by ratio of direct speech.)
Table 1 depicts the performance for different conditions.
Our baseline is using the speech sign (i.e. the long hyphen) as the only feature, which yields an F1 score of 0.734. Random-Forest performs best, with an F1 score of 0.939, which we consider to be an impressive result. Even when excluding the speech sign from the features, we still reach an F1 score of 0.924, much better than the hyphen alone.
After inspecting the models, it becomes clear that only very few features carry strong cues for direct speech, namely (and unsurprisingly) the initial long hyphen. Most other features, taken separately, carry weak signals in either direction, but become relevant in combination.
Error analysis reveals that incorrect assignments (false positives and negatives) are frequently due to imperfect sentence segmentation. Several features which have been previously used to define and recognize direct speech (question / exclamation marks, interjections, verbal tenses) also cause incorrect assignments, especially in the context of homodiegetic narration, where the narrator is somewhat involved in the plot so that his narrator speech is similar to direct speech. Finally, letters are sometimes mistaken for direct speech, which makes sense given that in most of them, one person addresses one or several other people.
We applied the best-performing algorithm (Random Forest) to the entire text collection. Evaluation shows a certain drop in performance, with a weighted average success rate of 0.844, indicating less-than-perfect generalization. We noted a welcome absence of any strong bias for either direct or non-direct speech. Our results suggest that the average proportion of direct to non-direct speech across the collection is 61% sentences with direct speech (and 39% without direct speech).
While variance is considerable (see figure 3), the proportion of direct speech in French nineteenth-century novels is overall much higher than expected (and higher, for example, than the 35% reported by Brunner 2015 for German novellas).
Figure 4 shows that both fantastic novels and crime fiction have a significantly higher median for proportion of direct-speech than ‘littérature blanche’, but do not differ significantly from each other (for significance tests, we used the non-parametric Kruskal-Wallis test at a significance level of 1%).
Figure 5 shows that only the ratios for the 1850s and the 1880s have a significantly differing level. However, because the decades do not have perfectly balanced subgenre proportions, this is probably due to a subgenre imbalance rather than an effect of the time period.
Using a wide range of linguistic markers allows the reliable identification of direct speech, even in the absence of clear typographic markers. Performance is excellent to good (F1-score of 0.94 on the gold standard, weighted average success rate of 0.844 on unseen texts). Using our method reveals that nineteenth-century French novels contain a large proportion of sentences with direct speech (61% on average). Also, there are previously unseen differences in direct speech proportion for subgenre, but not for time period.
For future work, we plan to use several strategies to improve performance. One is to add more sequential information to our set of features. Examples include the position, inside a sentence, of certain lexical or typographical features as well as linguistic cues preceding and following direct speech. Also, we plan to expand our corpus to make it more balanced in terms of genres and decades. This will allow us to discover genre-related patterns of interest to literary historians in a more reliable manner and assess their significance with more confidence.
Supplementary material can be found at: https://github.com/cligs/projects/tree/master/2016/dh.
List of features used, sorted by descending rank by a one-rule classifier.
average merit average rank attribute
74.028 +- 0.168 1 +- 0 79 SPEECHSIGN
71.743 +- 0.16 2 +- 0 57 VER:impf
65.847 +- 0.234 3 +- 0 54 VER:pres
63.893 +- 0.155 4 +- 0 55 VER:simp
63.248 +- 0.136 5 +- 0 6 PUNCMARKDOT
59.48 +- 0.12 6 +- 0 29 MATCHINGPPER_SON
58.835 +- 0.094 7.7 +- 0.64 30 MATCHINGPPER_SES
58.695 +- 0.208 8.1 +- 0.94 24 MATCHINGPPER_IL
58.713 +- 0.104 8.4 +- 0.92 35 VERB_MOTION
58.364 +- 0.083 10.6 +- 0.49 28 MATCHINGPPER_SA
58.344 +- 0.417 10.8 +- 1.78 7 SENTENCELENGTH
58.172 +- 0.078 11.7 +- 0.46 61 VER:subi
57.492 +- 0.091 14 +- 1.41 25 MATCHINGPPER_ELLE
57.422 +- 0.103 14.5 +- 1.36 44 VERB_PERCEPTION
57.387 +- 0.248 14.9 +- 1.51 50 INNERSUBCLAUSE
57.356 +- 0.4 15.8 +- 2.09 48 UNKNOWNLEMMA
57.213 +- 0.07 16.5 +- 1.02 31 MATCHINGPPER_LEUR
57.143 +- 0.162 17.3 +- 1.1 60 VER:ppre
56.672 +- 0.042 20.2 +- 0.98 36 VERB_BODY
56.672 +- 0.115 21 +- 1.84 52 VER:cond
56.62 +- 0.136 21.7 +- 2.1 40 VERB_EMOTION
56.567 +- 0.072 22.3 +- 1.19 26 MATCHINGPPER_ILS
56.497 +- 0.033 23.9 +- 1.3 41 VERB_COGNITION
56.428 +- 0.044 25 +- 1 46 VERB_CONSUMPTION
56.201 +- 0.005 34.5 +- 4.06 20 MATCHINGPPER_VOTRE
56.339 +- 0.176 35.4 +-18.69 32 COMMAS
56.201 +- 0.005 35.8 +- 4.19 21 MATCHINGPPER_VOS
56.201 +- 0.005 35.8 +- 6.4 22 MATCHINGPPER_TOI
56.201 +- 0.005 36.3 +- 4.2 17 MATCHINGPPER_TES
56.201 +- 0.005 37.6 +- 7.35 5 PUNCMARKCOLON
56.195 +- 0.018 37.7 +-13.33 18 MATCHINGPPER_NOTRE
56.201 +- 0.005 38.2 +- 3.16 23 MATCHINGPPER_MOI
56.424 +- 0.296 38.4 +-25.85 47 VERB_COMMUNICATION
56.201 +- 0.005 38.6 +- 6.45 4 PUNCMARKEXCL
56.201 +- 0.005 38.7 +- 3.44 16 MATCHINGPPER_TON
56.201 +- 0.005 39.4 +- 4.82 15 MATCHINGPPER_TA
56.201 +- 0.005 39.6 +- 6.45 3 PUNCMARKQUSTION
56.201 +- 0.005 40.2 +- 8.81 8 MATCHINGPPER_JE
56.201 +- 0.005 41.8 +-10.17 9 MATCHINGPPER_TU
56.201 +- 0.005 43.5 +- 9.19 10 MATCHINGPPER_NOUS
56.201 +- 0.005 43.5 +- 2.84 13 MATCHINGPPER_MON
56.201 +- 0.005 44.6 +- 4.43 12 MATCHINGPPER_MA
56.201 +- 0.005 44.7 +- 6.47 11 MATCHINGPPER_VOUS
56.261 +- 0.436 45.6 +-27.28 1 AmmountOfPPER
56.201 +- 0.005 45.8 +- 9.65 75 INTERJECTION_FI
56.201 +- 0.005 48 +-14.72 76 INTERJECTION_HEP
56.201 +- 0.005 50.2 +- 9.34 73 INTERJECTION_EH
56.201 +- 0.005 50.2 +- 6.27 74 INTERJECTION_EUH
56.201 +- 0.005 51.3 +- 3.66 81 INTERJECTION_MADAME
56.203 +- 0.08 51.3 +-23.56 37 VERB_COMPETITION
56.201 +- 0.005 52.1 +-15.75 58 VER:infi
56.201 +- 0.005 52.3 +-16.54 56 VER:futu
56.201 +- 0.005 53 +- 8.91 78 INTERJECTION_OUSTE
56.201 +- 0.005 54.7 +-17.43 34 VERB_CONTACT
56.162 +- 0.116 55.6 +-18.7 33 VERB_WEATHER
56.201 +- 0.005 56.9 +- 6.55 64 INTERJECTION_OH
56.201 +- 0.005 57.3 +- 4.5 63 INTERJECTION_AH
56.135 +- 0.087 58 +-24.31 19 MATCHINGPPER_NOS
56.193 +- 0.015 58.7 +-13.46 77 INTERJECTION_OUF
56.201 +- 0.005 59.3 +- 9.42 67 INTERJECTION_HÉLAS
56.143 +- 0.07 59.6 +-16.69 14 MATCHINGPPER_MES
56.201 +- 0.005 59.9 +- 4.5 42 VERB_STATIVE
56.201 +- 0.005 60.2 +- 2.64 62 VER:subp
56.201 +- 0.005 60.6 +- 8.39 71 INTERJECTION_CHUT
56.201 +- 0.005 62 +- 6.36 70 INTERJECTION_HEM
56.193 +- 0.015 62.5 +-10.87 66 INTERJECTION_HEIN
56.197 +- 0.011 62.6 +- 8 65 INTERJECTION_HÉ
56.201 +- 0.005 62.7 +- 5.87 51 DEIKTIKA
56.201 +- 0.005 63 +- 4.07 80 INTERJECTION_MONSIEUR
56.201 +- 0.005 63 +-11.79 53 VER:impe
56.005 +- 0.298 63.8 +-24.78 38 VERB_POSSESSION
56.201 +- 0.005 64 +- 5.67 39 VERB_SOCIAL
56.201 +- 0.005 64.3 +- 4.86 45 VERB_CHANGE
56.197 +- 0.013 64.7 +- 8.74 68 INTERJECTION_BAH
56.201 +- 0.005 64.7 +- 5.27 59 VER:pper
56.197 +- 0.015 65.2 +- 7.08 69 INTERJECTION_HOLÀ
56.139 +- 0.062 66.7 +-20.16 27 MATCHINGPPER_ELLES
56.183 +- 0.008 71.8 +- 5.23 72 INTERJECTION_BRAVO
56.005 +- 0.121 74.2 +- 9.41 43 VERB_CREATION
56.079 +- 0.038 77.4 +- 1.56 2 AmmountOfDET
55.99 +- 0.142 78.1 +- 3.73 49 POSNPP