Converted from an OASIS Open Document
This poster presents the textbox
The CLiGS textbox is dedicated to making collections of literary texts in Romance languages freely available. It currently contains novels, novellas and short stories published between 1830 and 1940 in France
The individual text collections were created with various usage scenarios in mind, and each collection has been compiled in a slightly different manner. For example, the two collections of Spanish novels, the
Corpus of Spanish Novels (1880-1940) and the
Collection of 19th century Spanish-American Novels (1880-1916), have been prepared to be used for authorship attribution. Accordingly, the two collections have been balanced with regard to the number of texts from different authors. The poster will give an overview of the sub-collections of the textbox and also about the principles guiding their compilations.
Independently of their original source format (e.g. html or EPUB), the texts are prepared (with Python scripts or XSLT) according to a common TEI schema established by the CLiGS group
Moreover, the collections of French, Spanish, Spanish-American, and Portuguese novels as well as the Italian short stories are made available in a version combining basic structural markup (chapter and sentence divisions) with token-level linguistic annotation, including lemma, part-of-speech, morphology, and basic semantic annotation using Freeling (cf. Padró and Stanislovsky, 2012) and WordNet (see Figure 1). Finally, the collection of French plays is not only available in XML-TEI, but also in the “Zwischenformat” developed by the DLINA group (Kampkaspar et al., 2015).
Linguistic annotations in an XML format that is a minimal departure from the TEI standard to allow multiple token-level annotations
Besides the administrative metadata like license, responsibility etc. the collections focus on descriptive metadata. There are four main areas about which information is documented: metadata concerning the authorship (VIAF, name, country, gender), metadata concerning the literary work and editions (VIAF or other identifier, extent of the texts, print and the digital source), and finally metadata concerning the genre: Since the main focus of the project is literary genre, a considerable part of the metadata is directly connected to it. Any reference to genre in the title of the work is collected as a genre label. Besides that, a hierarchical system is used, comprising supergenre (e.g. “narrative” or “drama”), genre (that is, novels or novellas), subgenre (the subtype of the novel, for example “adventure novel” or “political novel”) and subsubgenre (optional, used for further differentiations like “war novel”).
There are many possible use cases for the textbox collections. The poster will demonstrate some results of these methods from the areas of authorship attribution (using the stylo package for R; Eder et al., 2016), network analysis (using NetworkX in Python), and topic modeling (using MALLET with “tmw” for Python). These scenarios are intended not only as examples of analyses conducted within the CLiGS group, but also as suggestions for potential users of the CliGS textbox, Figure 2 and 3 demonstrate some results for authorship attribution and network analysis.
Authorship attribution, results of cosine delta on the Corpus of Spanish Novels (cf. Smith and Alridge, 2011; Evert et al., 2017)
Character network based on number of words spoken in mutual presence (represented by the thickness of the lines), for Jean Racine's tragedy Britannicus (1669)