Converted from a Word document
Tikkoun Sofrim
is part of a network of projects (
,
Sofer Mahir
,
eRabbinica
,
Scribes of the Cairo Genizah
,
Scripta-PSL
) aimed at developing a future framework for comprehensive and systematic textual availability, editions and deep annotation of (Hebrew) manuscripts via a pipeline that combines HTR with crowdsourcing the corrections and validation by scholars. End uses are scholarly editions (with
TEI-Publisher
), library service (integration via
mirador
), automatic annotation, distant-reading and big-data studies.
The project focuses on Tanhuma-Yelamdenu Midrashim, late rabbinic exegetical works without full scholarly critical edition, even thought vulgate editions in two recensions [1,2] and translations exist [3]. Unlike other classical works, they were not subject to formal canonization, but went through a long period of fluid transmission [4]. Manuscripts differ significantly from each other on the micro- and macro-levels. The texts reflect different possible hierarchies following the biblical text, liturgical rites or divisions in early print editions. Hence, the envisaged data structure will need to go beyond a simple canonical structure to support flexible accessibility to the texts both from library and critical edition perspectives.
The current Layout analysis is based on the z-profile-method for column detection [5] and the heartbeat-seam-carve algorithm for line detection [6], well adapted for literary manuscripts with regular layout. Textual ground truth was created by manual transcription from scratch or through the adaptation of transcriptions sent by scholars who had worked on manuscript-parts. Using the open-source kraken engine [7], we trained models for four large manuscripts (
Geneva Com. Lat. 146
,
BNF Cod. Héb 150
,
Bibl. Vaticana ebr. 44 vol. 1
and
2
,
Parma, Palatina, Cod. Parm. 312
2) [8]. The CERs vary between 2.8% (BNF), 2.9% (Parma), 6.9% (Vatican) and 8.9% (Geneva). The likely reasons for the weaker results for Vatican and Geneva were the use of low resolution images (Vatican) as well as binarization issues, curved lines and superposed words (Geneva). These issues will be dealt with in an upcoming, much improved layout analysis based on state-of-the-art CNNs part of
kraken@eScriptorium
at PSL [9,10,11].
Our
crowdsourcing platform
opened on Feb. 13th, 2019 [12]. To maximise our reach, we offer a designated UI for mobile and desktop. The technological stack used for development was selected to simplify future code contribution and collaboration with other platforms: Vue.js / Google-Firebase for the operational site and MySQL for analytical purposes.
Since classical rabbinic literature is still thriving in contemporary ritual and religious study cycles, we based community recruitment and engagement on links to websites for daily studies following annual and other cycles of studies. The launching was set to the beginning of Deuteronomy in
929
, a community daily Hebrew Bible chapter project, and we assigned fitting pages to each biblical chapter.
While the link with 929 proved very fruitful for a first broad exposure to the public, by and large ongoing traffic mainly came from intensive facebook activity in relevant groups and pages.
Attracted first by hilarious mistakes in the automatic transcription, users continued into long term engagement. Other crowd-engagement methods involved presentations in schools, synagogues, universities and homes for senior citizens Israel, France and the US. Two articles in the press, one of which included the URL of our website, greatly augmented the number of users in week 6 [13,14].
Until April 26, we received on the average 5005 lines per week:
Mega-users (>1000 lines) and super-users (>100 lines), presenting 4.6% and 19.3% of the number of user, provided almost 80% of the total amount of transcriptions.
Based on the submitted transcriptions, we produced an alignment on the word level with
collatex
[15,16]. With a majority vote, we established a preliminary complete text for the 590 pages of Geneva 146. According to manual check, the automatic reconstruction results in about two mistakes per page (ca. 1700 glyphs/page), i.e. an accuracy of ca. 99.8%, most of which concern markup for additions or deletions.
A transcription based on collating contributions of 5 users/line.
We now have the tools and know-how for high-quality automatic transcription of Hebrew manuscripts. Crowdsourcing is a promising mechanism for correcting remaining errors. Future versions of the platform should include embedded community engagement, feedback elements to ensure continuity and optimize social media engagement activity.
The project [17] provides PoC for future large scale production of manuscript texts, based on combined deep learning and crowdsourcing mechanisms. These will be based on optimizing the performance of HTR engines and crowd task-assignment, using crowd based transfer learning and weak supervision
.