<title type="main">UpCASE – A Web Application for Building and Maintaining Language Resources

Introduction

UpCASE (Upload, Correct, Annotate, Search and Export) is an open source web application Code and license can be found at . UpCASE is delivered as Maven-based Java-project, with an embedded servlet container and database. To install and run the tool, only Java and Maven are needed. that provides researchers and lay people a wide range of options for viewing, evaluating, editing and enriching text collections. It is conceived as a multifunction web application where users can upload text documents in different formats, including scanned texts whose characters are automatically recognized by Optical Character Recognition (OCR). While being able to search the uploaded text, users can correct, annotate and also export it into different formats.

Motivation and Background

UpCASE is the result of years of work with Romansh texts and lexical resources. Romansh The official denomination for Romansh according to the federal constitution of Switzerland is Rhaeto-Romanic. In this paper, the term Romansh denotes Rhaeto-Romanic as spoken in the Canton of Grisons (Liver, 2010). is the smallest of the four national languages in Switzerland with approximately 50.000 native speakers (Furer, 2005). Despite several actions by official organizations, e.g. the ratification of the European Charter for Regional and Minority Languages, Romansh is on a continuous retreat and is by now considered an endangered language. The main motivation of our work was to create a suite of tools to support the Romansh language community in building and maintaining language resources for Romansh. However, UpCASE is not restricted to be used within this particular context. For UpCASE is fully open source and can be used as API, it can be extended to be used with languages other than Romansh, e.g. by implementing extensions for language-specific issues (like different charsets, encodings, lexical resources, etc.). Our interest here is to show the relevance of using tools like UpCASE for small languages in general, since in the European Union alone there are more than 100 languages, many of them having little or no support by official institutions. http://www.eurominority.eu

We present UpCASE together with a specific historical text collection, namely the Romansh Chrestomathy (RC) compiled by Caspar Decurtins (Decurtins, 1888-1919). The RC comprises texts from four centuries reflecting the different idioms of Romansh Romansh is not just one single and consistent language. In fact, it can be subdivided into 5 main idioms, namely Sursilvan, Sutsilvan, Surmiran, Puter and Vallader, all of which have an independent orthography (Gross, 2004: 13). Alongside, in 2001 the Canton of Grisons introduced Rumantsch Grischun as official literary language meant to canopy the individual idioms. and is recognized as the most important historical text collection of the Romansh language (cf. Egloff and Mathieu, 1986: 7). It contains approximately 7500 pages covering a wide range of different topics, text types and genres, and therefore is an excellent basis for the compilation of a text corpus. All in all, the RC can be seen as a monument for language, speakers, and culture of Romansh in Switzerland, and as such constitutes an exception for small linguistic and cultural communities (Rolshoven, 2012).

The RC text corpus was created in two successive projects funded by the German Research Fund (DFG). In the first project, the RC was digitized and its characters recognized by OCR. The main objective was to correct the OCR output to provide a digital full text version of the RC. Due to the characteristics of the RC as a multilingual historical text collection of a small language, with varying orthographical standards and almost no digital lexical resources available, the correction of OCR errors could not be solved in a fully automatic manner (Rolshoven, 2012). Instead, we implemented a web-based editing tool allowing native speakers to participate in the task of OCR correction. http://www.crestomazia.ch. For comparable approaches at a larger scale see (Holley, 2009), among others. In a follow-up project the corrected texts were enriched with part-of-speech (POS) tags. As in the treatment of OCR errors, a fully automatic POS-tagging procedure could not be applied to the RC (Neuefeind, 2013). First we compiled a lexical resource by digitizing lexica and generating inflected word forms. On this basis, we approached the linguistic annotation with a semi-automatic procedure, combining lexical lookup (resulting in mostly ambiguous tags) with manual correction and supervision, thus adapting the collaborative methodology from the DRC-project (Mondaca and Atanassov, 2016).

Features and Technical Aspects

UpCASE brings together the experiences of both projects, combining the key features of collaborative corpus construction, enrichment and maintenance in a single web application. While existing tools mostly focus on a particular use case like collaborative correction (e.g. Wikisource http://www.wikisource.org ) or collaborative annotation (e.g. WebAnno https://webanno.github.io ), we connect these functionalities in a full workflow from raw to enriched text. UpCASE is based on several state-of-the-art web technologies such as Spring WebMVC, Spring Data, JAXB, and JQuery. The data is persisted in a document-oriented NoSQL database (MongoDB), whose records are structurally similar to JSON objects. The use of JSON enables a straightforward communication with other resources. Using predefined REST (Representational State Transfer) interfaces, distributed language systems may be used for data enrichment, without the need of complex data conversion. To demonstrate the possibility to interact with a remote resource, we added a customized translation feature for the particular use case presented here. We make use of the Pledari Grond (PG), an online dictionary for Romansh that we have developed in collaboration with the Lia Rumantscha ( http://www.liarumantscha.ch). A web service delivers a list of translations for a selected token by directly querying the PG. When no results are returned, the user can optionally enable a notification supplying the PG editorial staff with a request for translational support.

Using robust and scalable software on server side, and lightweight, clean and interactive components on client side, UpCASE offers different views in order to improve its usability. There are options to treat the collection as a whole, e.g. for searching, statistics and exporting, or to modify the data at hand, e.g. to edit, annotate or enrich. After importing text documents (or scanned images of texts which are OCR’ed), the text is indexed with Lucene and made accessible through an editable directory tree together with a full text search access. The stats view offers some basic statistical information about the text collection. In the export view, the user can choose different formats, e.g. plaintext or XML, to export the whole collection or parts of it. At the document level, each token is represented by a clickable widget, which opens a modal window containing different views – depending on user rights – associated with specific functions, e.g. editing, correction or annotation. The edit view allows the user to modify the text, e.g. to correct errors produced in the OCR process. The view contains both the editable word form and the relevant part of the scanned image with its highlighted position. The annotation view allows the user to create annotations like POS-tags on the fly, thus allowing complex searches on the search view.

Summary

Our presentation gives an overview of UpCASE and its basic functions, focusing on features for corpus maintenance, extension and enrichment. While in the first place we present a Romansh language resource, the concepts and features of this use case can be transposed to other text collections and languages. Thus UpCASE can be seen as an approach to technologically and methodologically support the preservation of the cultural heritage of regional and minority languages.

Bibliography Decurtins, C. (1888-1919). Rätoromanische Chrestomathie, 13, Erlangen: Junge. Reissued by Octopus-Verlag/Società Retorumantscha, Chur (1982-1986). Egloff, P. and Mathieu, J. (1986). Rätoromanische Chrestomathie: Register. Chur: Octopus-Verlag/Società Retorumantscha. Furer, J.-J. (2005). Eidgenössische Volkszählung 2000: Die aktuelle Lage des Romanischen. Neuchatel: Bundesamt für Statistik. (accessed 21 February 2016). Gross, M. (2004). Romansh: Facts & Figures. Chur: Lia Rumantscha. Holley, R. (2009). Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers. National Library of Australia. (accessed 21 February 2016). Liver, R. (2010). Rätoromanisch: Eine Einführung in das Bündnerromanische. Tübingen: Narr. Mondaca, F. and Atanassov, M. (2016). ARC: Annotierte Rätoromanische Chrestomathie. Atti del VI Colloquium retoromanistich Cormons 2014. Udine: Società Filologica Friulana, pp. 13-28. Neuefeind, C. (2013). The Digital Romansh Chrestomathy. Towards an Annotated Corpus of Romansh. In: Zampieri, M. and Diwersy, S. (Eds.), Special Volume on Non-Standard Data Sources in Corpus-Based Research (ZSM Studien 5). Aachen: Shaker pp. 41-58. Rolshoven, J. (2012). Die Digitale Rätoromanische Chrestomathie. Ladinia, 36: 119-51.