Converted from a Word document
Libraries and archives throughout Europe host books with ordinances, or individual ordinances (‘laws’) from the 15th to the 18th century (Härter, 1996). These texts contain indications of how governments of burgeoning states dealt with unexpected threats to safety, security, and order through home-invented measures, borrowed rules, or adjustments of what was established elsewhere. The hundreds of texts within these books are frequently consulted by researchers of various disciplines (e.g. history, law, political sciences, linguistics) to unravel rules for controlling complex societies. Having the possibility of a longitudinal search, based upon contents rather than the index or title, as well as having an overview based upon several states has so far been impossible due to the impenetrable amount of scanned texts. This project will disclose the entangled histories of neighbouring states, due to synchronic and diachronic comparisons – allowing a wider search and implementation in other projects Europe-wide.
This project-in-process will focus on the Dutch Early Modern Ordinances, but the techniques will have a widespread positive effect. This project consists of three steps:
1.
Improving the OCR-techniques of digitised sources. We plan to use the Handwritten Text Recognition suite Transkribus to reprocess the files which currently have poor quality OCR. By treating the printed works – which were printed with hand-carved letters – as very consistent handwriting we hope to obtain a higher quality of recognition (CER <5%).
2. Segmentation of the texts,
going from sentence-recognition (Lonij, Harbers 2016)
to article-segmentation of larger texts; this requires that the computer is trained to recognise the beginning and end of texts, either as a chapter or as an individual text within a compilation of texts. Initially, this can be based upon image/layout but could evolve into integration with HTR, depending on its feasibility.
3. We want to
create machine-generated metadata by training a computer to recognise the conditions that categories were based upon. This will be based on an already developed genre classifier (Lonij, Harbers, 2016) that can be fitted to automated content analysis. After
supervised training, the computer can then suggest, apply and supplement categories to other texts based on the idea of topic modelling (Leydesdorff, 2017). This is a pilot that will prove the applicability of the tool to other languages as well.
Due to its significance for such a broadly studied range of sources, we hope to make the output available as RDF. We will use NLP such as NER technologies to identify dates, titles, persons; and implement the output in the Dutch national infrastructure. With the data generated in this project, visualisations of the development of laws across the Netherlands – and possibly Europe - would become possible.
Outcomes of this project: