Converted from an OASIS Open Document
Digital literary studies advance in their research, requiring more specific metadata about literary phenomena: narrator (Hoover 2004), characters (Kastorp et al. 2015), place and period, etcetera. This metadata can be used to explain results in tasks like authorship attribution or genre detection, or to evaluate digital methods (Calvo Tello 2017). What could be the most efficient way to start annotating this information in corpora of thousand of texts in languages, genres and historical periods for which many NLP tools are not trained for? In this proposal, the aim is to identify specific literary metadata about entire texts with methods that are either language-independent or easily adaptable for humanists.
The two approaches to classify unlabeled samples applied here are rule-based classification and supervised machine learning. In rule-based classification (Witten et al. 2011), domain experts define formalised rules that correctly classify the samples. For example a rule based on a single token can be defined for each class to predict whether a text is written in third person (83% of the corpus) or first person using tokens for the two values are the Spanish words
dije ('I said') and
dijo (‘he said’), and the rule:
The results of applying this rule can be presented as a confusion matrix:
Fig 1. Confusion Matrix of rule-based results about narrator
For supervised methods (Müller and Guido 2016; VanderPlas 2016), we need labeled samples to train and evaluate the method. In the following table, the different classifiers and document-representations achieve different accuracy scores:
Fig 2. Accuracy (F1-score) for narrator
The data is part of the
Corpus of Spanish Novels of the Silver Age (1880-1939) (used in Calvo Tello et al. 2017), with 350 novels in XML-TEI by 58 authors. Each text has been annotated manually with metadata and its degree of certainty has been assigned. 262 texts with either high or medium certainty have been used to create a gold-standard with the following classes:
The scripts have been written in Python (available on GitHub)
Different classify algorithms (cross validation, 10 folds) and amount of Most Frequent Words have been evaluated. For each class a single token was used to represent each class value and a ratio was assigned for the default class value (see repository in GitHub for rules). Both approaches were compared to a “most populated class” baseline, quite high in many cases.
The results of both approaches are as following:
Fig 3. Results
In many cases the baselines are higher than the results of both approaches. The rule outperformed the baseline in the case of name of the setting with very good results. In two cases (narrator and setting's type), Machine Learning is the most successful approach and its F1 is statistically higher than the baseline (one sample t-test, ɑ = 5%). The algorithms Supported Vector Machines, Logistic Regression and Random Forest are most successful, while tf-idf and speacilly z-scores got the best results, the last one a data representation “highly uncommon in other applications” different from stylometry (Kestemont et al, 2016).
In this proposal I have used simple rules and simple features in order to detect relatively complex literary metadata in many cases with high baselines. While Machine Learning showed a statistically significant improvement in detection for two classes (type of setting and narrator), rules worked better for the name of the setting. This is a promising point to continue researching in order to annotate the rest of the corpus.