Converted from a Word document
The written heritage of the “Islamicate”
Introduced by the University of Chicago historian Marshall Hodgson, the term “Islamicate” refers to anything Islamic and non-Islamic, religious and non-religious that was produced in the vast geographical region we refer to as the “Islamic world” (see, Hodgson, 1974).
Currently, OpenITI includes about 4,300 unique book titles (predominantly in Arabic) by over 1,800 authors, which amounts to approximately 750 million words (1,46 billion words with all versions). Most texts have been collected from open-access online collections of high-quality digital reproductions of premodern texts.
Such as
http://shamela.ws/,
http://shiaonlinelibrary.com/,
https://ganjoor.net/, etc..
Figure 1. Chronological distribution of authors and books in OpenITI.
OpenITI is organized in compliance with Canonical Text Services (CTS) guidelines as implemented in the
CapiTainS Suite (Clérice et al., 2017), with two exceptions made for practical purposes. First, we implemented human-readable URNs (
uniform resource name) as this allows for easier subsetting of the corpus and makes it easier to engage non-DH specialists in Islamicate studies in collaboration. Second, we are postponing conversion to TEI XML, since it poses a number of challenges for right-to-left languages, connected scripts, and extensive texts (many texts are over 1 million words).
Figure 2. CTS URN Structure: al-Ḏahabī’s
Taʾrīḫ al-islām.
Figure 3.
CapiTainS
-Compliant Folder Structure: Versions of al-Ġazālī’s
Iḥyāʾ ʿulūm al-dīn incorporated into the repository
0525AH within OpenITI.
CTS offers a powerful mechanism for building expandable and interoperable corpora. The power of CTS lies in the URN, which provides the permanent canonical references to texts and are used by CTS to identify or retrieve passages of text (
Figure 2). The tree-like structure of CTS URNs together with modular folder organization (
Figure 3) allow one to easily expand the corpus in a decentralized manner as well as to accommodate as many texts and their versions in any number of languages as might be necessary.
Texts have been automatically converted into our custom format—
OpenITI mARkdown (
AR stands for Arabic). Facilitating conversion of raw texts into machine-actionable formats, this flavor of markdown: 1) simplifies work with multivolume texts that make up the core of the corpus; and 2) helps to avoid problems that one faces when paired symbols (e.g., angle brackets), LTR and RTL languages, and connected scripts occur in the same document, when even a simple editing task becomes overly complicated. Additionally,
mARkdown will facilitate conversion into TEI XML, the de facto standard for publishing digital editions. The detailed description of
OpenITI mARkdown can be found at
https://maximromanov.github.io/mARkdown/.
Figure 3. OpenITI mARkdown: A sample text (
biography) tagged morphologically and semantically (using EditPad Pro,
https://www.editpadpro.com/).
Available on GitHub ( https://github.com/OpenITI), OpenITI includes four different types of repositories:
In order to achieve comprehensiveness, we currently are working on the improvement of Arabic-script OCR (see, Kiessling et al., 2017) and the development of a digital text production pipeline (collaboratively with SHARIAsource, Harvard Law School). Called CorpusBuilder, this user-friendly, web-based, open-source application will significantly lower the technological barriers to entry and help us involve colleagues, librarians and citizen scientists from around the world in our collaborative project of corpus creation.