Converted from an OASIS Open Document
Although there have been some infrastructural developments of late, the main
modus operandi in digital literary studies is still to apply a certain research method to an ephemeral corpus. In a best-case scenario, the results are
somehow reproducible, in a worst-case scenario they are not reproducible at all. At best, there is an openly accessible corpus in a standard format such as TEI, another markup language, or at least TXT. At worst, the corpus is not even accessible, i.e., the research results cannot be questioned.
However, there are signs that this is slowly changing. Some projects provide interfaces that allow for multiple ways of access to corpora. One of these projects is DraCor, an open platform for research on (European) drama, which will be introduced in this paper (accessible at
Similar to the COST Action on European novels (Schöch et al. 2018), the DraCor project seeks to establish a bundle of multilingual drama corpora encoded in basic TEI as basis for digital comparative studies. To date, the platform enables access to a Russian-language (
The advantages of a freely available corpus hosted on GitHub are obvious. Not only can the corpus be cloned and loaded directly into an XML database like eXist-db. Using the SVN wrapper from GitHub, the entire corpus can also be downloaded directly, in its current state and without version history if this is not needed:
svn export https://github.com/dracor-org/rusdracor/trunk/tei
An openly accessible GitHub repository also means that pull requests for error correction are possible and welcome.
DraCor relies on eXist as XML database to process TEI files and to provide functions for researching the corpora. The frontend is built with React (
To come close to the ideal and the possibility of applying "all methods to all texts" in a simple manner (Frank/Ivanovic 2018), it takes more than open corpora. The article by Frank/Ivanovic advocates SPARQL endpoints (for which there is also a readily-available app for eXist-db:
DraCoroffers such endpoint, but also features a rich general API documented and explained via Swagger (
A simple use-case scenario would look like this: using RStudio you can throw a quick glance into a corpus with just a few lines of code, maybe regarding the development of the number of characters in Russian drama between 1740 and 1940, stored in the metadata table (
This very simple example is able to show the starting point of a decisive structural diversification of the Russian drama landscape. Pushkin's historical drama "Boris Godunov" (1825), result of his reading of Shakespeare, features speech acts of 79 characters, a number previously unthinkable in Russia drama.
However, the possibilities are not limited to using ready-made API functions. New research ideas always create new needs for easily obtainable and reproducible data and metrics; the API can be extended accordingly, i.e., new research ideas can be implemented centrally in the API layer. This is made even easier by the fact that Apache Ant can be used to rebuild the entire development environment on your own system.
In addition to structural data and metadata, full texts without markup can also be obtained, e.g., if methods such as stylometry or topic modelling are the purpose, i.e., methods that need a "bag of words" and do not require markup.
All in all, the structure and documentation of open APIs makes it much easier to reproduce research results, which up to now has often been a time-consuming (or impossible) process.
An example of the versatility of the DraCor API is the Shiny App created by Ivan Pozdniakov (
The formalisation of literary texts, for example via markup, is not self-explanatory. Although the community can rely on some standards, every operationalisation depends on the actual research question. To give an example: if you would like to extract network data based on character interactions in a literary text, you would have dozens of different ways of doing this (e.g., Grayson et al. 2016 test different extraction methods for novels and compare the results). This also applies to plays. In order to sharpen the senses for this in teaching, we developed the tool "ezlinavis" (
In addition to an approach to the gamification of the process of correcting TEI-encoded corpora (Göbel/Meiners 2016), we also developed a card game for teaching purposes in order to playfully train the understanding of network values (Fischer at al. 2018).
These didactic tools wrapped around the platform are an integral part of the whole project as they are based on the project data and operationalisations. While building the platform, it was important to recognise that data can take several forms and be equally important for research and teaching.
The TEI files contain GND [Integrated Authority File of the German National Library] and Wikidata identifiers for both authors and works. In this way, various data and facts that lie beyond one's own corpus work can be included. Something like an automatically created gallery of authors has a more illustrative character (de la Iglesia/Fischer 2016). But using the same identifiers, we can also determine if a corpus has a regional bias. Via Wikidata, we can easily display the distribution of the authors' places of birth and death on a map (by doing so, we could rule out that our German-language corpus GerDraCor has a regional bias, cf. Göbel/Fischer 2015).
Similarly, the Wikidata ID of the plays can be used to find out where they were first performed (example query:
Projects like DraCor seek to provide the digital literary studies with a reliable and extensible infrastructure so that the research community can focus on research questions.
An important conclusion for us was that we would give up the further development of our all-in-one Python script collection "dramavis" (Kittel/Fischer 2014–2018 and Fischer et al. 2017), which we have been developing for four years now. From here on, we would rather devote our time to the API. "Dramavis" followed the idea of rapid prototyping and had to do all by itself, including the preprocessing of data (Trilcke/Fischer 2018), which is not untypical in the Digital Humanities. The code base has grown quite a bit in the meantime and its maintenance has become difficult and often enough led away from actual research questions.
In allusion to the project "ProgrammableWeb" – which maintains a database of open APIs and whose slogan is: "APIs, Mashups and the Web as Platform" (accessible at
Programmable Corpora facilitate the implemention of research questions around corpora. It is to be expected that infrastructural efforts of this kind will pay off for the entire community with effects such as those listed by John Womersley in his presentation at the ICRI2018 conference in Vienna: a) dramatically increase scientific reach; b) address research questions of long duration requiring pooled effort; c) promote collaboration, interdisciplinarity, interaction.
The are numerous ways to connect to Programmable Corpora, no matter if you don't want to code at all and only need a CSV file for Excel or LibreOffice Calc or a GEXF file for Gephi, if you want to research a corpus via its connections to the Linked Open Data cloud or just want to get specific data for your R or Python script without having to worry about the corpus and its maintenance and reproducibility (all this remains an option, though). Programmable Corpora make it easier to decide on which level of the platform your own research process starts.