Converted from a Word document
The proposed paper aims to present the development of
Oralia diacrónica del español
Funded by MINECO/AEI/FEDER, UE (reference: FFI2017-83400-P).
The digital resource consists of a diachronic corpus of Spanish texts from the south of Spain (mainly, from the old Kingdom of Granada, comprising of the current provinces of Granada, Málaga and Almería) written between 1492 and 1833. These texts, characterised by communicative immediacy or conceptional orality (Koch and Oesterreicher, 2007), include inventories of goods, witnesses’ testimonies in criminal trials and surgeons’ reports on the state of an injured or dead person, where doctors and surgeons use both colloquialisms and learned words. Furthermore, there are texts from different archives in the south of Spain, which makes the corpus an excellent source for historical dialectology studies.
The corpus follows the successful model of the ERC-funded project
Post Scriptum: A Digital Archive of Ordinary Writing, based on TEITOK. This model allows the combination of two methodological approaches, which represent two subsequent stages in the creation of the corpus (Vaamonde, 2015; Carvalheiro, 2016: 177-78):
1) A philological approach that involves the digital edition of the manuscripts (selection of documents and metadata, transcription based on TEI-XML). The texts have been encoded following the
TEI P5 Guidelines. Furthermore, as proposed by CHARTA, the texts in the corpus can be visualised in three different formats: images of the manuscripts, diplomatic transcriptions and critical editions. Each text is presented with metadata, such as date, place and text type.
2) A corpus linguistics approach, in which texts are tokenized, normalized and annotated by PoS (Janssen, 2014), based on the international standard for European languages EAGLES (Expert Advisory Group on Language Engineering Standards), although the tagset has been adapted. NeoTag, a PoS tagger (Janssen, 2012), has been trained with another corpus of early modern Spanish: Post Scriptum. When a considerable amount of data has been annotated and manually corrected in ODE, this will be used as the training corpus to automatically annotate new texts, improving this way the accuracy of the PoS tagger.
Thanks to the user-friendly interface offered by TEITOK, it is possible to revise and modify the following information online: TEI tags, metadata, lemmas and PoS. Encoded information can be retrieved and visualised in different ways, such as KWIC, indexes and maps (visualised on Open Street Map). It is possible to search and browse by different filters (including text typology, place, date, archive), which can be applied simultaneously, and combined with other filters like lemma and PoS. A friendly-interface query builder allows the exploration of the corpus by a general audience with no background in computational linguistics (Janssen et al., 2017).
Finally, we would like to emphasize that the new online corpus has successfully overcome the following difficulties:
a) It combines digital textual scholarship (TEI) and corpus linguistics (based on the EAGLES international standard for morphosyntactic annotation and lemmatisation).
b) It allows working in a single edition that can be visualised in different formats by the end user in the digital resource.
c) Furthermore, it permits independent management, since scholars can upload and edit their work, having control over their own research without the need for an external person in charge of the digital resource.