Converted from an OASIS Open Document
UpCASE (Upload, Correct, Annotate, Search and Export) is an open source web application
UpCASE is the result of years of work with Romansh texts and lexical resources. Romansh
European Charter for Regional and Minority Languages,
We present UpCASE together with a specific historical text collection, namely the
Romansh Chrestomathy (RC) compiled by Caspar Decurtins (Decurtins, 1888-1919). The RC comprises texts from four centuries reflecting the different idioms of Romansh
1986: 7). It contains approximately 7500 pages covering a wide range of different topics, text types and genres, and therefore is an excellent basis for the compilation of a text corpus. All in all, the RC can be seen as a monument for language, speakers, and culture of Romansh in Switzerland, and as such constitutes an exception for small linguistic and cultural communities (Rolshoven, 2012).
The RC text corpus was created in two successive projects funded by the German Research Fund (DFG). In the first project, the RC was digitized and its characters recognized by OCR. The main objective was to correct the OCR output to provide a digital full text version of the RC. Due to the characteristics of the RC as a multilingual historical text collection of a small language, with varying orthographical standards and almost no digital lexical resources available, the correction of OCR errors could not be solved in a fully automatic manner (Rolshoven, 2012). Instead, we implemented a web-based editing tool allowing native speakers to participate in the task of OCR correction.
2013). First we compiled a lexical resource by digitizing lexica and generating inflected word forms. On this basis, we approached the linguistic annotation with a semi-automatic procedure, combining lexical lookup (resulting in mostly ambiguous tags) with manual correction and supervision, thus adapting the collaborative methodology from the DRC-project (Mondaca and Atanassov, 2016).
UpCASE brings together the experiences of both projects, combining the key features of collaborative corpus construction, enrichment and maintenance in a single web application. While existing tools mostly focus on a particular use case like collaborative correction (e.g. Wikisource
Using robust and scalable software on server side, and lightweight, clean and interactive components on client side, UpCASE offers different views in order to improve its usability. There are options to treat the collection as a whole, e.g. for searching, statistics and exporting, or to modify the data at hand, e.g. to edit, annotate or enrich. After importing text documents (or scanned images of texts which are OCR’ed), the text is indexed with Lucene and made accessible through an editable directory tree together with a full text search access. The stats view offers some basic statistical information about the text collection. In the export view, the user can choose different formats, e.g. plaintext or XML, to export the whole collection or parts of it. At the document level, each token is represented by a clickable widget, which opens a modal window containing different views – depending on user rights – associated with specific functions, e.g. editing, correction or annotation. The edit view allows the user to modify the text, e.g. to correct errors produced in the OCR process. The view contains both the editable word form and the relevant part of the scanned image with its highlighted position. The annotation view allows the user to create annotations like POS-tags on the fly, thus allowing complex searches on the search view.
Our presentation gives an overview of UpCASE and its basic functions, focusing on features for corpus maintenance, extension and enrichment. While in the first place we present a Romansh language resource, the concepts and features of this use case can be transposed to other text collections and languages. Thus UpCASE can be seen as an approach to technologically and methodologically support the preservation of the cultural heritage of regional and minority languages.