Converted from a Word document
Research context
Handwritten Text Recognition (HTR) is “the ability of a computer to transform handwritten input represented in its spatial form of graphical marks into an equivalent symbolic representation as ASCII text.” (Romero et al., 2012: 5)
What is the state of the art of the application of HTR to early modern manuscripts? With what level of accuracy
can HTR models automate their transcription? What is known about how HTR currently accommodates manuscript text that shows changing writing styles, hands and text in multiple languages? We will explore these questions with reference to the wider literature and a case study of the first HTR model to be created for the hand of
Optical Character Recognition on documents with perfectly machine-printed characters can reach an accuracy level of more than 99% (Cao and Natarajan, 2014: 336–37). However OCR is often problematic for historical documents (Smith and Cordell, 2019: 5)
. It also cannot be used for handwritten documents, since the space between characters and words is inconsistent.
Holistic segmentation-free off-line
HTR technology works at a line level and can deal with cursive characters, slanted words and irregular calligraphy, but it must be trained for a specific handwriting
(Alabau and Leiva, 2012: 2274; Sánchez et al., 2014: 111–12)
. HTR is not accurate enough to replace human expertise, however it holds the potential to bolster the transcription process (Toselli et al., 2018: 174;176)
.
‘
Enlightenment Architectures: Sir Hans Sloane’s Catalogues of his Collections is a Leverhulme-funded collaboration between the British Museum and UCL.
It studies 5 of the manuscript catalogues of
The catalogue of Miscellanies, two of his Natural History catalogues (Fossils vol. I and vol. V) and two of his library catalogues (Sloane MS 3972C vol. VI and Sloane MS 3972B).
The HTR model discussed here was trained using the software Transkribus. The aim of the e-Infrastructure project READ (Recognition and Enrichment of Archival Documents) is to make archival sources more accessible through technological development. The centrepiece of READ is the service platform and application Transkribus, which enables the automatic recognition and transcription of handwritten documents and the ability to search within them (READ project, 2018a; READ project, 2018b)
.
Methodology
To train an HTR model with Transkribus, one has to provide it with training data (digital surrogates of the original folios and their transcriptions). This is known as
ground truth
or
reference data.
The segmentation of the document into its elements, in particular the baselines, and the actual transcription is crucial for creating an adequate HTR model. The ground truth data must consist of a representative sample of a collection’s documents and also respect the original appearance of the script, e.g. special characters, as closely as possible. With Transkribus, this serves the purpose of training the HTR model, and also the evaluation of its accuracy. Between 75 and 100 pages (around 15,000 to 20,000 words) of training data are necessary for an effective HTR model.
A randomized selection of documents is recommended (
READ project, 2018c: 3–4; READ project, 2017: 10)
.
We determined that the
first sub-section of the Miscellanies catalogue (folio 2-152v) would give enough training and test data to evaluate the model because it contains important characteristics of the whole collection of catalogues, such as annotations and a complex layout.
We wish to thank the members of the Enlightenment Architectures team for their assistance in making this selection and for their wider advice about this case study.
For this research, five different HTR models were created to allow a comparison between their changing accuracy. This includes one pre-test model. Training started with 75 folios and was then increased to 100 and 125 folios. For the last model, in addition to the 125 folios of training data, a base model was added.
Results
The quality of an HTR transcription can be evaluated according to a Word Error Rate (WER) and Character Error Rate (CER
) (Romero et al., 2012: 93). Transkribus allows both measures (READ project, 2018c: 5).
WER is […] “the minimum number of words that need to be substituted, deleted or inserted to convert a sentence recognized by the system into the corresponding reference transcription, divided by the total number of words in the reference transcription […]”
(Romero et al., 2012: 55). C
ER is the minimum number of single characters which need to be corrected, divided by the total number of characters in the reference text
(Romero et al., 2012: 55)
. Transkribus also allows the evaluation of the general accuracy of a model with a learning curve visualisation and the accuracy of a model on the page level to be specified via the compute accuracy function (READ project, 2018d: 9–12). According to READ
(2018d: 10),
a model with an accuracy rate of 90% can be regarded as an effective automated transcription.
The evaluation showed that our current model of 20,803 words reached a CER of 12.73% without the base model. The transcription has not reached a level of accuracy that is sufficient for academic research without further human input. The model has problems transcribing names (persons and places), abbreviations, double letters (e.g. ee), punctuation, Latin text and the numbers in the margins correctly.
Conclusion
In the paper we will reflect on how our methodology and model might be refined in order to improve the CER, in line with the experiences of other projects (for example Hodel, 2017 or
Prell, 2018). We will give particular attention to questions like ‘
Where in particular does recognition fail?’. ‘
How much training data is necessary to create a model with an accuracy of at least 90%?’ and ‘how might external resources like gazetteers and name authority lists be integrated into Transkribus and used in conjunction with the HTR model in order to increase the accuracy of the transcription of named entities?
Our responses to questions like this are likely to be transferable to other projects who seek to build HTR models for the transcription of early-modern manuscript materials.
Although our model reached a relative high level of accuracy, is it not good enough to be used for scholarly work. We will therefore also reflect on scenarios where the model could still be used, such as Authorship Attribution (Franzini et al.,