Converted from an OASIS Open Document
The COST Action
Distant Reading for European Literary History is a collaborative, interdisciplinary network which aims “to facilitate the creation of a broader, more inclusive and better-grounded account of European literary history and cultural identity” (
The
Distant Reading Action aims to “develop the resources and methods necessary to change the way European literary history is written” (“Memorandum of Understanding”). To that end, the Action will create a diachronic, multilingual, medium-sized open access benchmark corpus of novels from 1840-1919, called ELTeC (European Literary Text Collection). A working group dedicated to ‘Scholarly Resources’ is collecting sets of 100 novels in at least 10 European languages. These novels are encoded according to the recommendations of the TEI (Text Encoding Initiative; see TEI Consortium 2018 and Burnard 2014) and made freely available via GitHub and Zenodo (see
Distant Reading network on evaluating and improving methods and tools supporting text annotation and analysis for Digital Literary Studies. It also provides the materials to work on theoretical concerns and use cases from European literary history also taking place in the
Distant Reading network. Hence, ELTeC is a key activity in the
Distant Reading network.
The key challenge regarding the corpus design and annotation schemas for ELTeC is the need to handle the tension between valid and meaningful criteria and an operationalisation of these criteria for texts from many different cultures, a challenge typical of large-scale DH projects. The corpus design is therefore metadata-based by defining a set of parameters instead of relying on literary canons. Therefore, we integrate famous, canonical as well as forgotten, non-canonical novels. We represent the variation of production and aim to maximize the variety within each time period. The corpus design of ELTeC is similar to reference corpora designed to serve as an empirical base for different research approaches. We have put a focus on the composition of the corpus. The corpus will be balanced with regard to parameters which include language, publication date, author gender, length and reprint counts. The corpus also includes ‘smaller’ varieties of a language in order to be able to consider the high variance of the vernacular languages across Europe.
The ELTeC is encoded using TEI XML, customised for Distant Reading methods and tools. “Our main goal has been to identify a small core set of textual features which can be readily (preferably automatically) identified in existing digital transcriptions, or easily and consistently provided by new transcriptions” (Section ‘Principles’ of the white paper “Encoding Guidelines for the ELTeC”). Unlike many Digital Scholarly Editing projects, the intention is not to support a rich representation of the full complexities of an original source, but rather to represent in a consistent and economical way an agreed minimum of the textual features most relevant to Distant Reading practices. We distinguish a basic encoding (level 0), a richer TEI encoding (level 1) and a linguistic encoding (level 2). A single TEI ODD is defined, from which we derive schemas and documentation for each level of annotation.
We integrate already existing textual resources, whether already TEI-encoded or not, e.g. from Project Gutenberg (
Collaborative corpus creation, documentation and publication of ELTeC is happening on our GitHub organisation (COST-ELTeC; see:
Distant Reading
for European Literary History (CA16204) is a COST Action funded by the COST Association through the Horizon 2020 Framework Programme of the EU.