Converted from a Word document
One of the key problems in historical political research is that many relevant research questions can only be answered by means of a prolonged and painstaking analysis of large archival series. Questions like: “How did the relationship between the government, the parliament and the political elites change over time?”, “What role did political traditions and rituals play?”, and “In what ways did the information economy influence the political process?”, still require scholars to systematically work through vast bodies of archival material. Only a few scholars, who appeared not to be intimidated by such a daunting task, have come up with long-term analyses of political patterns. This paper proposes a new, digital approach to avoid these time-consuming activities and to open up major archival series for automated text analysis, by applying various tools for text recognition and automated structuring, as well as by using reference data and re-using metadata.
The case study selected to
http://resources.huygens.knaw.nl/besluitenstatengeneraal1576-1630/index_html_en (accessed 3 March 2016)demonstrate the potential of this approach is the opening up of the Resolutions of the Dutch States-General from 1576 to 1795. This archival series of the central assembly of the seven Provinces forming the Dutch Republic is an excellent example of the type of source suitable to answer the above mentioned research questions. Editing the Resolutions has been a task of the Huygens Institute for the History of the Netherlands and its predecessors since 1915 (Japikse and Rijperman, 1915-1970, Van Deursen et al., 1971-1994).
In 2015 we started a totally different approach as an alternative to editing this vast collection of documents. Our goal is to make this important collection accessible, searchable, and analyzable for historical research by applying various advanced digital humanities tools. We will do this by using the metadata of the already printed edited Resolutions, and by enriching and linking the data to other relevant research data. The advantage of this approach is that the Resolutions will be made accessible for researchers in a far quicker way in comparison to the classic way of editing. Also, this approach is to provide insights which will be useful for comparable projects dealing with important archival series in the future, and may provide an alternative to full scale editing of large historical sources in the digital era.
On the basis of experience with digital editing, we know that performing a small scale experiment is the best way of establishing best practices and avoiding huge costs. Therefore, we have chosen to apply a multilevel approach with several pilot projects, using various tools which may be applicable for exploring the content of some 200,000 pages of resolutions for historical research. These projects depart from the principle of using what is already there. This means, in practice, that we will construct a framework consisting of the metadata added by the previous editors, like indexes; mark-ups of names, places and institutions; and classification of subjects, as well as contemporary indexes and marginal subject-notes in the resolutions. This framework will serve as a reference to make the resolutions corpus accessible.
Secondly, we experiment with tools for Handwritten Text Recognition (HTR). The software developed by researchers from the Universitat Politècnica de València, which is integrated in the
https://transkribus.eu/Transkribus/ (accessed 3 March 2016)Transkribus platform, offers the most valuable results (Sanchez et al., 2013).
Transkribus platform. As has been demonstrated by the developers from Valencia, manual correction of incorrect transcriptions of the HTR-software leads to a recalculation diminishing the mistakes the software made in the rest of the text. This will speed up the work towards an accurate transcription for the whole series. Nevertheless, with the limited number of people able to read 17
th-century Dutch handwriting taken into account, we cannot expect crowdsourcing to provide us with a perfect transcription of the whole series of resolutions within a few years.
Finally, we investigated the use of tools for enriching the resolutions with annotations that will improve exploring the digitized material. For this purpose we used contemporary printed resolutions, of which a series exists from 1703 to 1796. We selected a set of 100 pages from the year 1725, containing 366 resolutions. The text of the resolutions was transcribed and marked up manually using TEI. The printed material consists of blocks of text that can be categorized as follows: “session”, consisting of the date of the meeting, the name of the chairman, and the attendees representing each of the seven provinces; “resumption”, the summary of the previous meeting; “resolution”, the body of the resolutions themselves; and “insertion”, mostly incoming letters. We used a standard machine learning approach. The text blocks were marked up with their type manually. Part of the material was used for training the automated categorizer, the other part for testing. The categorizer was trained using as features fixed expressions the successive clerks of the States-General used in their account of the meetings. The comparison of the results with the manual categorization turned out to be promising. We were able to categorize the different types of text blocks with a 98,6% precision.
The next step was to extract information from the text blocks. Building on software of the
Stanford Natural Language Processing Group we developed a rule-based tool for recognizing dates with a 99,1% precision. The dates in the “session” text blocks could be identified easily, for they have a fixed structure; this allows us to annotate each resolution with its date. Software for identifying more complex dates in the text blocks (for example “the resolution taken yesterday”) is yet to be written; it will be used to annotate resolutions with references to other resolutions.
Furthermore, we developed a tool using a Naïve Bayes Classifier for categorizing the attendance list. The manuscript and printed resolutions list the attendants according to the province they represented; the provinces were mentioned in a fixed order. However, at some meetings a province was not represented. With the tool we are able to identify the provinces the attendants as well as the chairman represented in these cases also.
Finally, we took some steps in interpreting the content of the resolutions. Again using fixed phrases and a Naïve Bayes Classifier, it turns out to be possible to recognize in most cases whether or not a decision was taken in a resolution (96,9% precision) and whether the States-General asked a person or a body for advice (99,3%). The results for analyzing to whom the States-General asked advice; and whether and to whom they asked for a report, are yet inconclusive because of the limited amount of training material.
Apart from improving the results for this last mentioned analysis, our future work will be dedicated to several issues. Firstly, applying Named Entity Recognition to the resolutions in combination with the use of reference data. The Huygens ING owns and hosts several relevant data collections, for instance the
http://www.biografischportaal.nl/en/ (accessed 3 March 2016) http://resources.huygens.knaw.nl/repertoriumambtsdragersambtenaren1428-1861/index_html_en (accessed 3 March 2016) Biography Portal of the Netherlands
Compendium of Office-Holders and Civil Servants 1428-1861