Character-distinguishing features in fictional dialogue: the case of War and Peace Daniil Skorinkin skorinkin.danil@gmail.com National Research University Higher School of Economics, Russia Introduction The study of character speech is a topic of fairly consistent interest among digital literary scholars. It is usually acknowledged that voices of characters are essentially different from narrator's own voice and should be treated separately. Some researchers have fictional dialogue removed from the texts they studied before any tools of computational investigation are applied (Hoover, 2004). Quite a lot of effort has been made recently to address the problem of identifying character speech in prose and attributing it to the correct speaker (ccbMku!). One of the outcomes of such research is the possibility to study voices of different characters on relatively large scale and apply computational tools that measure their recurring stylistic parameters. Method The study of character speech has traditionally had strong ties to the fields of stylometry and authorship attribution, as their methods proved quite useful for studying idiolect of a fictional speaker. Suffice it to say that one of the seminal works in stylometry, Computation into criticism# by Burrows (Burrows, 1987), was focused on the study of character speech in Jane Austen's novels. The method developed by Burrows grew into what is currently known as Delta, a widely-adopted standard for authorship attribution. Delta has been consistently and successfully applied to identifying the author of an unattributed text of different languages and genres, but at the same time it saw considerable usage as a purely stylometric tool for the study of text where authorship is undisputed. Among other things, this included research into the specific idiolects of fictional characters (see, for example, Rybicki, 2006). In our research Delta was used as one of the two possible approaches to studying character voices in Leo Tolstoy's War and peace. Much like in case of Senkewic (Rybicki, 2006), there's certain critical opinion (Eikhenbaum, 2009) that Tolstoy's characters are quite distinct from each other in their speech. Our own experience of carefully reading speech instances extracted from War and peace (for details on extraction procedure see (Skorinkin, Bonch-Osmolovskaya, 2015) supports the opinion. So it seemed natural to try and test computational methods that already showed their applicability to precisely such task. We used R package stylo by (Eder et al, 2013) Testing the method on Russian material Surprisingly enough, we were unable to find any work that applied Delta to any Russian material. Therefore we felt obliged to conduct a couple of experiments that would test its general applicability to Russian before we proceed with character speech. At the first stage we tried Delta's ability to distinguish between Tolstoy and Dostoevsky. The training set contained one of the six parts of Dostoyevsky's Crime and Punishment and three of the fifteen books of Tolstoy's War and Peace. The remaining 18 pieces of text (5 by Dostoevsky and 13 by Tolstoy) constituted the test set. The results with different settings can be seen in Table 1 and Figures 1,2: ┌───────────────┬──────────┬─────────┬─────────┬──────────┐ │N most frequent│Words │grams │grams │grams │ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │25 │80% (4/5) │60% (3/5)│60% (3/5)│100% (5/5)│ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │30 │80% (4/5) │80% (4/5)│60% (3/5)│80% (4/5) │ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │35 │80% (4/5) │60% (3/5)│60% (3/5)│80% (4/5) │ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │40 │80% (4/5) │60% (3/5)│60% (3/5)│80% (4/5) │ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │45 │80% (4/5) │60% (3/5)│80% (4/5)│100% (5/5)│ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │50 │80% (4/5) │60% (3/5)│80% (4/5)│100% (5/5)│ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │55 │80% (4/5) │60% (3/5)│80% (4/5)│80% (4/5) │ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │BO │100% (5/5)│60% (3/5)│80% (4/5)│80% (4/5) │ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │65 │100% (5/5)│60% (3/5)│80% (4/5)│80% (4/5) │ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │70 │80% (4/5) │60% (3/5)│80% (4/5)│100% (5/5)│ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │75 │80% (4/5) │80% (4/5)│80% (4/5)│80% (4/5) │ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │80 │80% (4/5) │60% (3/5)│80% (4/5)│100% (5/5)│ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │85 │80% (4/5) │80% (4/5)│80% (4/5)│100% (5/5)│ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │90 │100% (5/5)│80% (4/5)│80% (4/5)│100% (5/5)│ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │95 │80% (4/5) │80% (4/5)│80% (4/5)│100% (5/5)│ ├───────────────┼──────────┼─────────┼─────────┼──────────┤ │100 │80% (4/5) │80% (4/5)│80% (4/5)│100% (5/5)│ └───────────────┴──────────┴─────────┴─────────┴──────────┘ Table 1. Delta authorship attribution, Tolstoy vs Dostoevsky Tolstoy vs Dostoevsky Principal Components Analysis ┌────────────────────────────────────────────────────┬────────────────────────┐ │ │Tolstoy_VoinalmirTom4Ch2│ │ │ │ │ │Tolstoy_VoinalmlrTom4Ch3│ ├────────────────────────────────────┬───────────────┼────────────────────────┤ │Dostoevsky PrestuplenielNakazanieCh2│TolstoyJ │/„¡n,lni?<^t^:“^.^ │ │ │ │[irTom2ch2] │ │DosToevsky_PrestuplenielNakazanieCh1├───────────────┼────────────────────────┤ │ │ │Tolstoy VoinalmirTom1Ch3│ ├────────────────────────────────────┼───────────────┼────────────────────────┤ │ │ │Tolstoid[(]S|t5)P! │ │ │ │VMiemWfn4Ch4 │ ├────────────────────────────────────┼───────────────┼────────────────────────┤ │Dostoevsky_PrestuplenielNakazanieCh5│ │VoinalmirT^^^'^^ │ │ │ │OJvolnalmlrTomlChl │ ├────────────────────────────────────┼───────────────┼────────────────────────┤ │ │ │Tolstoy_VolnalmlrTom2Ch5│ ├────────────────────────────────────┼───────────────┼────────────────────────┤ │ │PC1 (40.9%) │ │ │-0.4 -0.2 │ │0 °-^2 │ │ │0 MFC 4-grams │ │ │ │Culled @ 0% │ │ └────────────────────────────────────┴───────────────┴────────────────────────┘ Fig. 1. Delta PCA on 150 most frequent character 4-grams, Tolstoy vs Dostoevsky [625-1] Fig. 2. Delta PCA on 100 most frequent words, Tolstoy vs Dostoevsky The second experiment involved four Russian authors Tolstoy, Dostoevsky, Goncharov and Turgenev. All four represent (roughly) the same epoch of Russian literature and all four are recognized as masters of realistic prose. We used three novels by each author for our experiment. At the first stage two out of each three were placed in the training corpus, and Delta was supposed to attribute the remaining one. All four novels from the test corpus were attributed correctly. At the second stage we reverted the experiment and left only one novel by each author in the training set. In this case Delta consistently showed 7 out of 8 correct attributions (the only mistake being Tolstoy's Family Happiness incorrectly attributed to Dostoevsky.A possible explanation could be that Family Hap-piness is written in first person from the point of view of a young woman, something uncommon for Tolstoy; and the only Dostoevsky's work the training corpus contained was The Insulted and Humiliated , also a firstperson narrative). Fig. 3 shows Delta scores for all the novels visualized with help of principal component analysis. XIX century novels Principal Components Analysis ┌────────────────────────────────────────────┬────────────────────────────┐ │roHMapo^fii^pa^gph.B │TypreHeB_flBopnHCKoe rHe3flo│ ├────────────────────────────────────────────┼────────────────────────────┤ │Tyf │)reHeB_OTpbi „ fle™ │ ├────────────────────────────────────────────┼────────────────────────────┤ │rOHHapOB_06blKHOBeHHafl MCTOpriB │TypreHeB_flbiM │ ├────────────────────────────────────────────┼────────────────────────────┤ │ │ToncToti_BoMHa h wnp │ ├────────────────────────────────────────────┼────────────────────────────┤ │flocToeBCKMM_npecTynneHne u HaKa3aHne │ToncTOu_AHHa KaperinHa │ ├────────────────────────────────────────────┼────────────────────────────┤ │flocToeBCKkiii_yHki>iHHHPOCTOBa│ │AHfl pew Bon KOHCKMM │flonox │ │ │ ├────┬────────────┬───────────┼───────┼──────────────┼────────────────┤ │ │ │ │ │ │HiiKOnanPoCTOB │ ├────┼────────────┼───────────┼───────┼──────────────┼────────────────┤ │ │ │ │KyTy3OB│ │ │ ├────┼────────────┼───────────┼───────┼──────────────┼────────────────┤ │Bac,│>™nKypar[M]H│roHKapaTaeB│ │fleHMCOB │ │ ├────┼────────────┼───────────┼───────┼──────────────┼────────────────┤ │ │ │ │ │HwKoriaiiBonKO│MnuaPoCTOB │ └────┴────────────┴───────────┴───────┴──────────────┴────────────────┘ Fig. 10. PCA for top 15 most talkative characters in War and Peace, alternative features [625-6] [625-7] Fig. 11. Hierarchical clustering for top 15 most talkative characters in War and Peace, alternative features Note that here we do not see any similarity between Andrey and Pierre. Moreover, Andrey is close to Napoleon, which is rather striking given Napoleon is his hero and role model for a considerable part of the novel. [625-8] The separation of Vera, on the other hand, is still rather visible - she is far from Moscow-centered Rostov world and close to Saint-Petersburg world of Ku-ragine family and berg.