Entangled Histories of Early Modern Ordinances. Segmentation of Text and Machine-Learned Metadating.

Entangled Histories of Early Modern Ordinances. Segmentation of Text and Machine-Learned Metadating. Romein Christel Annemieke Ghent University; KB-Library; Erasmus University Rotterdam Annemieke.Romein@UGent.be Veldhoen Sara Floor KB-Library sara.veldhoen@kb.nl 2019-04-15T18:20:00Z Name, Institution

Street City Country Name

Converted from a Word document

DHConvalidator Paper Poster OCR; text-segmentation; machine-generated metadata; RDF; automatic content analysis corpus and text analysis history and historiography metadata law English OCR and hand-written recognition

Libraries and archives throughout Europe host books with ordinances, or individual ordinances (‘laws’) from the 15th to the 18th century (Härter, 1996). These texts contain indications of how governments of burgeoning states dealt with unexpected threats to safety, security, and order through home-invented measures, borrowed rules, or adjustments of what was established elsewhere. The hundreds of texts within these books are frequently consulted by researchers of various disciplines (e.g. history, law, political sciences, linguistics) to unravel rules for controlling complex societies. Having the possibility of a longitudinal search, based upon contents rather than the index or title, as well as having an overview based upon several states has so far been impossible due to the impenetrable amount of scanned texts. This project will disclose the entangled histories of neighbouring states, due to synchronic and diachronic comparisons – allowing a wider search and implementation in other projects Europe-wide.

This project-in-process will focus on the Dutch Early Modern Ordinances, but the techniques will have a widespread positive effect. This project consists of three steps:

1. Improving the OCR-techniques of digitised sources. We plan to use the Handwritten Text Recognition suite Transkribus to reprocess the files which currently have poor quality OCR. By treating the printed works – which were printed with hand-carved letters – as very consistent handwriting we hope to obtain a higher quality of recognition (CER <5%).

2. Segmentation of the texts, going from sentence-recognition (Lonij, Harbers 2016) to article-segmentation of larger texts; this requires that the computer is trained to recognise the beginning and end of texts, either as a chapter or as an individual text within a compilation of texts. Initially, this can be based upon image/layout but could evolve into integration with HTR, depending on its feasibility.

3. We want to create machine-generated metadata by training a computer to recognise the conditions that categories were based upon. This will be based on an already developed genre classifier (Lonij, Harbers, 2016) that can be fitted to automated content analysis. After supervised training, the computer can then suggest, apply and supplement categories to other texts based on the idea of topic modelling (Leydesdorff, 2017). This is a pilot that will prove the applicability of the tool to other languages as well.

Due to its significance for such a broadly studied range of sources, we hope to make the output available as RDF. We will use NLP such as NER technologies to identify dates, titles, persons; and implement the output in the Dutch national infrastructure. With the data generated in this project, visualisations of the development of laws across the Netherlands – and possibly Europe - would become possible.

Outcomes of this project:

Improved performance of the current OCR-recognition of (early modern) texts by incorporating HTR-techniques on printed texts. (A use-case of Transkribus.) The possibility to automatically recognise segments in text layout: beginning or end, columns, titles, dates, summaries and the body of the text. This data will be used for the RDF-compliant tool. Expansion of automatic content analysis based upon segments, rather than on lines or sentences, with a machine-learned algorithm, and applying this with standardised machine-learned metadata to early modern normative texts. This will allow researchers to search through approximately 15.000 ordinances and resolutions from various provinces in the Low Countries, accessible through uniform search terms. An RDF-compliant tool to enable application in other ongoing and future projects. Integration of the enhanced datasets in the CLARIAH/CLARIN ecosystem.

Bibliography Härter, K.; Stolleis, M.; Schilling, L. (eds.) (1996), Policey im Europa der Frühen Neuzeit. Frankfurt a/Main. Härter, K. and M. Stolleis (1996), Introduction to Repertorium der Policeyordnungen der frühen Neuzeit, Band 1. Deutsches Reich und geistliche Kurfürstentümer (Kurmainz, Kurköln, Kurtrier), edited by K. Harter. Frankfurt a/Main: pp. 1-36. Leydesdorff, L., and A. Nerghes. (2017) "Co ‐word maps and topic modeling: A comparison using small and medium ‐sized corpora (N< 1,000)." Journal of the Association for Information Science and Technology 68(4): 1024-35. Lonij, J., Harbers, F. (2016), Genre classifier. KB Lab: The Hague http://lab.kb.nl/tool/genre-classifier .