<title type="main">An Online Corpus For The Study Of Historical Dialectology Oralia diacrónica del español (ODE) Calderón Campos Miguel University of Granada, Spain calderon@ugr.es Díaz Bravo Rocío University of Granada, Spain rociodiazbravo@ugr.es 2019-05-10T17:43:00Z Name, Institution
Street City Country Name

Converted from a Word document

Paper Long Paper Corpus linguistics historical linguistics Spanish computational linguistics digital textual scholarship corpus and text analysis spanish and spanish american studies linguistics English

The proposed paper aims to present the development of Oralia diacrónica del español

Funded by MINECO/AEI/FEDER, UE (reference: FFI2017-83400-P).

(ODE corpus, University of Granada), a new digital resource for the study of historical dialectology, thanks to TEITOK, “a web-based framework for corpus creation, annotation, and distribution, that combines textual and linguistic annotation within a single TEI based XML document” (Janssen, 2016).

The digital resource consists of a diachronic corpus of Spanish texts from the south of Spain (mainly, from the old Kingdom of Granada, comprising of the current provinces of Granada, Málaga and Almería) written between 1492 and 1833. These texts, characterised by communicative immediacy or conceptional orality (Koch and Oesterreicher, 2007), include inventories of goods, witnesses’ testimonies in criminal trials and surgeons’ reports on the state of an injured or dead person, where doctors and surgeons use both colloquialisms and learned words. Furthermore, there are texts from different archives in the south of Spain, which makes the corpus an excellent source for historical dialectology studies.

The corpus follows the successful model of the ERC-funded project Post Scriptum: A Digital Archive of Ordinary Writing, based on TEITOK. This model allows the combination of two methodological approaches, which represent two subsequent stages in the creation of the corpus (Vaamonde, 2015; Carvalheiro, 2016: 177-78):

1) A philological approach that involves the digital edition of the manuscripts (selection of documents and metadata, transcription based on TEI-XML). The texts have been encoded following the TEI P5 Guidelines. Furthermore, as proposed by CHARTA, the texts in the corpus can be visualised in three different formats: images of the manuscripts, diplomatic transcriptions and critical editions. Each text is presented with metadata, such as date, place and text type.

2) A corpus linguistics approach, in which texts are tokenized, normalized and annotated by PoS (Janssen, 2014), based on the international standard for European languages EAGLES (Expert Advisory Group on Language Engineering Standards), although the tagset has been adapted. NeoTag, a PoS tagger (Janssen, 2012), has been trained with another corpus of early modern Spanish: Post Scriptum. When a considerable amount of data has been annotated and manually corrected in ODE, this will be used as the training corpus to automatically annotate new texts, improving this way the accuracy of the PoS tagger.

Thanks to the user-friendly interface offered by TEITOK, it is possible to revise and modify the following information online: TEI tags, metadata, lemmas and PoS. Encoded information can be retrieved and visualised in different ways, such as KWIC, indexes and maps (visualised on Open Street Map). It is possible to search and browse by different filters (including text typology, place, date, archive), which can be applied simultaneously, and combined with other filters like lemma and PoS. A friendly-interface query builder allows the exploration of the corpus by a general audience with no background in computational linguistics (Janssen et al., 2017).

Finally, we would like to emphasize that the new online corpus has successfully overcome the following difficulties:

a) It combines digital textual scholarship (TEI) and corpus linguistics (based on the EAGLES international standard for morphosyntactic annotation and lemmatisation).

b) It allows working in a single edition that can be visualised in different formats by the end user in the digital resource.

c) Furthermore, it permits independent management, since scholars can upload and edit their work, having control over their own research without the need for an external person in charge of the digital resource.

Bibliography Calderón Campos, M. and García Godoy, M. T. (coords.). Oralia diacrónica del español (ODE). Granada: Universidad de Granada. Online at: < http://corpora.ugr.es/ode/index.php?action=home&lang=en&lang=es >. Carvalheiro, C., et al. (2016). A idade dos “desvios”: diacronia, variaçao social e linguística de corpus. In Kabatek, J. (ed), Lingüística de corpus y lingüística histórica iberorrománica. Walter de Gruyter, pp. 175-96. CHARTA International Network (2015). Corpus Hispánico y Americano en la Red: Textos Antiguos (CHARTA). Online at: < http://www.corpuscharta.es/ >. Janssen, M. (2012). NeoTag: a POS tagger for grammatical neologism detection. International Conference on Language Resources and Evaluation 2012: Conference Proceedings. Istanbul: ELRA, pp. 2118–24. Janssen, M. (2014). TEITOK, a Tokenized TEI environment. Online at: < http://alfclul.clul.ul.pt/teitok/site/index.php >. Janssen, M. (2016). TEITOK: Text-Faithful Annotated Corpora. International Conference on Language Resources and Evaluation 2016: Conference Proceedings. Paris: ELRA, pp. 4037-43. Janssen, M., Ausensi, J. and Fontana, J. M. (2017).  Improving POS Tagging in Old Spanish Using TEITOK. NoDaLiDa Workshop on Processing Historical Language 2017: Conference Proceedings. Gothenburg: NEALT, pp. 2-6. Koch, P. and Oesterreicher, W. (2007). Lengua hablada en la Romania: español, francés, italiano. Tübingen: Max Niemeyer Verlag. TEI Consortium (eds).  TEI P5: Guidelines for Electronic Text Encoding and Interchange. Online at: < http://www.tei-c.org/Guidelines/P5/ >. Vaamonde, G. (2015). P. S. Post Scriptum: dos corpus diacrónicos de escritura cotidiana. Procesamiento del Lenguaje Natural, 55: 57-64.