From Handwritten Text to Structured Data: Alternatives to Editing Large Archival Series Sluijter Ronald Huygens Institute for the History of the Netherlands ronald.sluijter@huygens.knaw.nl Scherer Marielle Huygens Institute for the History of the Netherlands marielle.scherer@huygens.knaw.nl Derks Sebastiaan Huygens Institute for the History of the Netherlands sebastiaan.derks@huygens.knaw.nl Nijenhuis Ida Huygens Institute for the History of the Netherlands ida.nijenhuis@huygens.knaw.nl Ravenek Walter Huygens Institute for the History of the Netherlands walter.ravenek@huygens.knaw.nl Hoekstra Rik Huygens Institute for the History of the Netherlands rik.hoekstra@huygens.knaw.nl 2016-03-04T08:23:00Z Maciej Eder, Pedagogical University in Krakow Jan Rybicki, Jagiellonian University
Institute of Polish Studies Pedagogical University ul. Podchorazych 2 30-084 Krakow, Poland maciej.eder@ijp-pan.krakow.pl

Converted from a Word document

Paper Short Paper handwritten historical documents metadata text analysis information retrieval historical studies information retrieval metadata scholarly editing text analysis content analysis data mining / text mining English digital humanities - diversity
Introduction

One of the key problems in historical political research is that many relevant research questions can only be answered by means of a prolonged and painstaking analysis of large archival series. Questions like: “How did the relationship between the government, the parliament and the political elites change over time?”, “What role did political traditions and rituals play?”, and “In what ways did the information economy influence the political process?”, still require scholars to systematically work through vast bodies of archival material. Only a few scholars, who appeared not to be intimidated by such a daunting task, have come up with long-term analyses of political patterns. This paper proposes a new, digital approach to avoid these time-consuming activities and to open up major archival series for automated text analysis, by applying various tools for text recognition and automated structuring, as well as by using reference data and re-using metadata.

The case study selected to demonstrate the potential of this approach is the opening up of the Resolutions of the Dutch States-General from 1576 to 1795. This archival series of the central assembly of the seven Provinces forming the Dutch Republic is an excellent example of the type of source suitable to answer the above mentioned research questions. Editing the Resolutions has been a task of the Huygens Institute for the History of the Netherlands and its predecessors since 1915 (Japikse and Rijperman, 1915-1970, Van Deursen et al., 1971-1994).

http://resources.huygens.knaw.nl/besluitenstatengeneraal1576-1630/index_html_en (accessed 3 March 2016)

This task is hindered, though, by the vast size of the resolutions, which approximately consists of 200,000 pages. The classic edition process, not even aimed at providing a full transcription of the resolutions but only abstracts, reached the year 1625 in 1994. After that, a project was carried out to edit the resolutions from 1626 to 1630 only in digital form, with the help of xml-coding (Nijenhuis, 2006; Nijenhuis et al., 2007). This project ended in 2007 and was not pursued further, because it was clear that this method also was too time and money consuming.

In 2015 we started a totally different approach as an alternative to editing this vast collection of documents. Our goal is to make this important collection accessible, searchable, and analyzable for historical research by applying various advanced digital humanities tools. We will do this by using the metadata of the already printed edited Resolutions, and by enriching and linking the data to other relevant research data. The advantage of this approach is that the Resolutions will be made accessible for researchers in a far quicker way in comparison to the classic way of editing. Also, this approach is to provide insights which will be useful for comparable projects dealing with important archival series in the future, and may provide an alternative to full scale editing of large historical sources in the digital era.

Reusing metadata

On the basis of experience with digital editing, we know that performing a small scale experiment is the best way of establishing best practices and avoiding huge costs. Therefore, we have chosen to apply a multilevel approach with several pilot projects, using various tools which may be applicable for exploring the content of some 200,000 pages of resolutions for historical research. These projects depart from the principle of using what is already there. This means, in practice, that we will construct a framework consisting of the metadata added by the previous editors, like indexes; mark-ups of names, places and institutions; and classification of subjects, as well as contemporary indexes and marginal subject-notes in the resolutions. This framework will serve as a reference to make the resolutions corpus accessible.

Automated Handwritten Text Recognition

Secondly, we experiment with tools for Handwritten Text Recognition (HTR). The software developed by researchers from the Universitat Politècnica de València, which is integrated in the Transkribus platform, offers the most valuable results (Sanchez et al., 2013).

https://transkribus.eu/Transkribus/ (accessed 3 March 2016)

For the HTR-experiment we manually transcribed 40 pages of handwritten resolutions. Of these pages, 30 were used for training and 10 for testing. The resulting Word Error Rate for this experiment was 40%. We realize that using an only partly correct transcription does not deliver the results one can expect from a traditional edition. The automated transcriptions generated by the HTR-tool should be seen as an alternative that enables researchers to search the text. Of course with a Word Error Rate of 40% this will not deliver perfect results. The HTR may be improved by expanding the training set and by the use of reference data, which is discussed below. In case our approach will be financed for the whole series of resolutions, we intend to use crowdsourcing to improve the HTR results on the handwritten resolutions via the Transkribus platform. As has been demonstrated by the developers from Valencia, manual correction of incorrect transcriptions of the HTR-software leads to a recalculation diminishing the mistakes the software made in the rest of the text. This will speed up the work towards an accurate transcription for the whole series. Nevertheless, with the limited number of people able to read 17 th-century Dutch handwriting taken into account, we cannot expect crowdsourcing to provide us with a perfect transcription of the whole series of resolutions within a few years.

Automated annotation

Finally, we investigated the use of tools for enriching the resolutions with annotations that will improve exploring the digitized material. For this purpose we used contemporary printed resolutions, of which a series exists from 1703 to 1796. We selected a set of 100 pages from the year 1725, containing 366 resolutions. The text of the resolutions was transcribed and marked up manually using TEI. The printed material consists of blocks of text that can be categorized as follows: “session”, consisting of the date of the meeting, the name of the chairman, and the attendees representing each of the seven provinces; “resumption”, the summary of the previous meeting; “resolution”, the body of the resolutions themselves; and “insertion”, mostly incoming letters. We used a standard machine learning approach. The text blocks were marked up with their type manually. Part of the material was used for training the automated categorizer, the other part for testing. The categorizer was trained using as features fixed expressions the successive clerks of the States-General used in their account of the meetings. The comparison of the results with the manual categorization turned out to be promising. We were able to categorize the different types of text blocks with a 98,6% precision.

The next step was to extract information from the text blocks. Building on software of the Stanford Natural Language Processing Group we developed a rule-based tool for recognizing dates with a 99,1% precision. The dates in the “session” text blocks could be identified easily, for they have a fixed structure; this allows us to annotate each resolution with its date. Software for identifying more complex dates in the text blocks (for example “the resolution taken yesterday”) is yet to be written; it will be used to annotate resolutions with references to other resolutions.

Furthermore, we developed a tool using a Naïve Bayes Classifier for categorizing the attendance list. The manuscript and printed resolutions list the attendants according to the province they represented; the provinces were mentioned in a fixed order. However, at some meetings a province was not represented. With the tool we are able to identify the provinces the attendants as well as the chairman represented in these cases also.

Finally, we took some steps in interpreting the content of the resolutions. Again using fixed phrases and a Naïve Bayes Classifier, it turns out to be possible to recognize in most cases whether or not a decision was taken in a resolution (96,9% precision) and whether the States-General asked a person or a body for advice (99,3%). The results for analyzing to whom the States-General asked advice; and whether and to whom they asked for a report, are yet inconclusive because of the limited amount of training material.

Apart from improving the results for this last mentioned analysis, our future work will be dedicated to several issues. Firstly, applying Named Entity Recognition to the resolutions in combination with the use of reference data. The Huygens ING owns and hosts several relevant data collections, for instance the Biography Portal of the Netherlands

http://www.biografischportaal.nl/en/ (accessed 3 March 2016)

and the Compendium of Office-Holders and Civil Servants 1428-1861

http://resources.huygens.knaw.nl/repertoriumambtsdragersambtenaren1428-1861/index_html_en (accessed 3 March 2016)

with which we will be able to identify persons and institutions mentioned in the resolutions. Secondly, we aim at improving the OCR for the printed resolutions, for the benefit of the automated annotation. Thirdly, we will investigate whether the tools for automated structuring can be applied also to the automated transcription of the handwritten resolutions.

Bibliography Deursen, A. T. van, Smit, J. G. and Roelevink, J. (Eds.) (1971-1994). Resolutiën der Staten-Generaal: Nieuwe reeks, 1610-1670. Vol 7, Gravenhage. Nijenhuis, I. (2006). Besluiten ontsloten. Resolutiën Staten-Generaal digitaal (1626-1630). De Zeventiende Eeuw, 22: 272-82. Nijenhuis, I. J. A., De Cauwer, P. L. R., Gijsbers, W. M., Hell, M., Meij, C. O. van der and Schooneveld-Oosterling, J. E. (Eds.) (2007, update 2011). Resolutiën Staten-Generaal 1626-1630. Den Haag. Japikse N. and Rijperman, H. H. P. (1915-1970). Resolutiën der Staten-Generaal van 1576 tot 1609. Vol. 14. Den Haag. Sánchez, J. A., Mühlberger, G., Gatos, B., Schofield, P., Depuydt, K., Davis, R. M., Vidal, E. and Does, J. de (2013). tranScriptorium: a European Project on Handwritten Text Recognition. Proceedings of the 2013 ACM symposium on Document engineering. New York: ACM Press, pp. 227-28. Thomassen, T. (2015). Instrumenten van de macht. De Staten-Generaal en hun archieven 1576-1796. Den Haag: Huygens ING. Wittek, P. and Ravenek, W. (2011). Supporting the Exploration of a Corpus of 17th-Century Scholarly Correspondences by Topic Modeling. In Maegaard, B. (Ed.), Supporting Digital Humanities 2011: Answering the unaskable. Copenhagen.