Converted from a Word document
As most of modern and pre-modern western writing systems explicitly represent division of the words in a sentence by spaces or breaks, it has been easy to use computers to analyze texts based on each word and its meanings. However, there are several modern and pre-modern writing systems that do not explicitly indicate word separation in texts; that is, all words in a sentence are contiguous. A major contemporary representative of this kind of writing is seen in the language system of East Asia. Moreover, a popular Japanese pre-modern writing system called kuzushi-ji (cursive style characters) had often been presented with undivided characters even in typesetting until the late nineteenth century (Fig. 1, Fig. 2). As the lack of word-separation has been evoking not only ambiguity but also multiple interpretations, it has formed an aspect of cultural richness in Japanese culture. However, as a result, Japanese texts have intrinsically presented difficulties: not only in the case of textual analysis but also in both manual and automatic transcription in the digital era. This presentation will discuss problems in these writing systems and the current situation of attempts to resolve them through the methods of digital humanities.
Recent Japanese texts do not have serious problem in case of OCR due not only to the separation of each character but also accuracy and clarify of its printing. However, it is difficult to OCR books printed even ten decades ago because of two points: most of them uses relatively complicated characters for OCR and parallel embedded small-font size texts (called ruby in HTML5) which explain pronunciation of a word, and are too close to the explained word to OCR (Fig. 3), even though they were printed by metal typesetting. More three decades ago characters were sometimes connected, and the writing style of characters were partially cursive (Fig. 4). Recently, some researchers are attempting to develop tools for recognition of kuzushi-ji not based on the shape of individual characters but by continuous shapes of characters. They have not yet reached the stage where they are able to transcribe all characters accurately, for both technical and intrinsic reasons, but the technology can nonetheless assist in reading such texts by showing candidates of characters (Fig. 5)
Hashimoto, Yuta, et al. The SMART-GS Project: An Approach to Image-based Digital Humanities.
Digital Humanities 2014:476-477. 2014.
However, there are special difficulties presented when a needed character is not encoded in Unicode. It seems to be similar with the case of Medieval Unicode Font Initiative
http://folk.uib.no/hnooh/mufi/ Pandey, Anshuman. Proposal to Encode the Siddham Script in ISO/IEC 10646. ISO/IEC JTC1/SC2/WG2 N4294. 2012.
http://www.unicode.org/L2/L2012/12234r-n4294-siddham.pdf .
KAWABATA , Taichi, Toshiya SUZUKI, Kiyonori NAGASAKI and Masahiro SHIMODA. Proposal to Encode Variants for Siddham Script. ISO/IEC JTC1/SC2/WG2 N4407. 2013. http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4407.pdf . Anderson, Deborah, et al. 2013-11-22 Siddham Script (梵字) Meeting @ Tokyo, JAPAN, Earth. ISO/IEC JTC1/SC2/WG2 N4523. 2013.
http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4523.pdf .
ITSCJ SC2 Committee, IPSJ, JAPAN. Proposal of Japanese HENTAIGANA. ISO/IEC JTC1/SC2/WG2 N4674. 2015.
http://unicode.org/wg2/docs/n4674-Japan_Hentaigana_Proposal-a.zip .
While efforts of transcription, due to commoditization of digitizing textual materials in hi-resolution, digital image databases have also been grown in Japan. Especially, the National Diet Library in Japan has been addressing the publication of digitized collection including over 300,000 books--since over decades ago and recently stated that most of them are to be released in the public domain
http://dl.ndl.go.jp/ http://hyakugo.kyoto.jp/ http://dzkimgs.l.u-tokyo.ac.jp/utlib_kakouzou.php http://www.nijl.ac.jp/ http://www.nijl.ac.jp/pages/cijproject/index_e.html
Crowd sourcing transcription has recently emerged also in Japan. Transcribe JP project has been conducted as a SIG of the Japanese Association for Digital Humanities. It provides a Web service
Hondigi2014. http://lab.ndl.go.jp/dhii/omk2/ 翻デジ@JADH×Crowd4U. http://www.jadh.org/transcribejp Crowd4U. http://crowd4u.org/en/
In spite of the difficulties of transcription, there are many digitized texts in Japanese. Aozora-Bunko
http://www.aozora.gr.jp/ http://www.ninjal.ac.jp/ http://21dzk.l.u-tokyo.ac.jp/SAT/
The texts of NINJAL consists of separated words with POS tags, but most of the others do not use this method. Then, methods for textual analysis are common in Japan: The one is n-gram analysis regarded a character as one “n”. The other is developing tools for automatic separation of words sometimes with POS tagger, such as Mecab
http://taku910.github.io/mecab/ http://chasen.naist.jp/hiki/ChaSen/ http://www.atilika.com/ja/products/kuromoji.html
In XML-formatted texts, suc has those maintained in TEI, JATS
http://jats.nlm.nih.gov/
In contexts of current DH, huge humanities resources have still been dormant. According to their awakening, these kind of issues should be gradually revealed and needed to be solved from both practical and abstract viewpoints. Through solving them earnestly under global communication, DH will come to better fruition.