Building worksets for scholarship by linking complementary corpora

Kevin Page

kevin.page@oerc.ox.ac.uk

University of Oxford, United Kingdom

Terhi Nurmikko-Fuller

terhi.nurmikko-fuller@anu.edu.au University of Oxford, United Kingdom

Timothy Cole

t-cole3@illinois.edu

University of Illinois, United States of America

J. Stephen Downie

jdownie@illinois.edu

University of Illinois, United States of America

Background and General Motivation

The HathiTrust Digital Library

The HathiTrust Digital Library (HTDL) comprises digitized representations of
15.1 million volumes: approximately 7.47 million book titles, 418,216 serial
titles, and 5.3 billion pages, across 460 languages. HTDL is best described as
“a partnership of major research institutions and libraries working to ensure
that the cultural record is preserved and accessible long into the future”.

The HathiTrust Research Center (HTRC) develops software models, tools, and
infrastructure to help digital humanities (DH) scholars conduct new
computational analyses of works in the HTDL. For many scholars the size of the
HTDL corpus is both attractive and daunting: many existing DH tools are
designed for smaller collections, and many research inquiries are facilitated
by more focused, homogeneous collections of texts (Gibbs and Owens, 2012).

Worksets

In many, if not most, DH research endeavours, performing an analytical task
across the whole HTDL is neither practical nor productive (Kambatla et
al., 2014). For example, a tool trained to identify genre attributes of 18th
century English language prose fiction may not be applicable to 20th century
French poetry. The first step is to identify the subset -- of works, editions,
volumes, chapters, pages -- to set an initial investigative scope and, indeed,
subsequent iterative refinements of a subset as research proceeds. In a corpus
as large and complex as the HTDL, finding materials and then defining the
sought after subset can be extraordinarily difficult.

HTRC has come to call collections of digital items brought together by a
scholar for her analyses a “workset”, created to help the researcher build,
manipulate, iteratively define and compare their collections. Reflecting upon
input and advice from the DH community, Jett (2015) defines a workset as a
machine-actionable research collection realised as:

1.    An aggregation of members (volumes, pages, etc.);

2.    Metadata intrinsic to the workset's essential nature (e.g., creator,
selection criteria);

3.    Metadata intrinsic to digital architectures (i.e. creation date & number
of members);

4.    Metadata supportive of human interactions (i.e. title & description);

5.    Derivative metadata from workset members (e.g. format(s), language(s),
etc.); and,

6.    Metadata concerning workset provenance (e.g. derived from, used by,
etc.).

Broadly, item 1 identifies the actual data used in an analysis; whereas the
remaining metadata items describe the workset itself, aiding workset
management throughout the research cycle.

Cross-corpus worksets

As alluded above, numerous criteria can be used to select the constituents of a
workset; and several technological implementations could, in theory,
realise worksets. In researching the design and realisation of worksets and
associated tooling, we are also mindful to remain grounded in their practical
application and the needs of scholarly users. We have therefore undertaken our
work through discipline-based scenarios in which we can explore the strengths
and weaknesses of the HTDL viewed through the prism of worksets.

We report one such exploration here, questioning whether (relatively) small,
well explored, and well understood corpora can be superimposed over the HTDL to
aid navigation and investigation of the much larger and superficially
understood HTDL collection?

From a system perspective, a cross-corpus workset requires exposing compatible
metadata (items 2-6 above) from multiple collections, first used to
align common elements, and then to assemble worksets. We take a Linked Data
approach and achieve compatibility through ontologies, which might initially be
bibliographic (and derived from library records) but should be iteratively
extensible into the domain of the subject of study.

Examples in early English print

Early English Books Online Text Creation Partnership (EEBO-TCP) is a
partnership with ProQuest and over 150 libraries and universities, led by
Michigan and Oxford, to generate highly accurate, fully-searcha-ble texts
tracing the history of English thought and learning from the first book printed
in English in 1473 through to 1700. Between 2000-2009 EEBO-TCP Phase I
converted 25,000 selected texts from the EEBO corpus into TEI-compliant,
XML-encoded hand-transcribed texts, subsequently freely released in
January 2015.

In the work reported here, we have conjoined EEBO-TCP with a HathiTrust subset
consisting of all materials described in their metadata as being in English and
published between 1470 and 1700.

To ensure a prototype which simultaneously explored the fit of scholars' needs
to the technology and exercised the technical challenges outlined in the
previous section, we undertook a ‘complete circuit' through the datasets
(Figure 1). We: (i) ran a consultative workshop to choose investigations which
might form the basis of worksets; (ii) used these abstract worksets to identify
concrete requirements for the conjoined metadata; (iii) generated metadata
from both corpora according to these specifications; (iv) aligned elements from
both datasets in an overlapping

superset; (v) realised the worksets identified in (i) using this metadata.

[606-1]

Figure 1. Overview of the metadata circuit leading to our cross-corpora workset

Motivating worksets

Following the workshop we identified the following workset selections; we
describe their implementation in subsequent sections:

• Find all the works, appearing in both datasets, written by Richard Baxter.

• Find works in both datasets published in Oxford.

• Find works published outside of London (where the bulk were published).

• Find works from both datasets published outside of London in the mid-to late
1600s.

• Find all works in the two datasets for authors who have at least once
published on the subject of “Political science”.

• Find all works in these two datasets for authors who have at least once
published works which are categorised as “biography”.

Regarding the penultimate workset, it is of particular note that this returns
results across both datasets, since our EEBO-TCP import did not contain genre
or topic information; this association must be entirely inferred from the
semantic links via the technology described below.

Implementation

Metadata requirements and ontology selection

Building on Nurmikko-Fuller et al. (2015) and Jett et al. (2016) we surveyed
the addressable resources and the schema expressivity of ontologies that
could parameterise these classes of workset. We identified

parsable information structures in the EEBO-TCP TEI

data, appropriate to the test worksets, and selected ontology terms to encode
this EEBO-TCP metadata, ensuring compatibility (or at least, for our
purposes, comparison) with RDF from the HathiTrust. The resultant ontology
collection - the EEBO Ontology, or EEBOO - includes selections from MODS,
Bibframe, and PROV, along with custom elements encoding additional structures
(e.g. dates).

Creating EEBOO RDF and alignment with HTDL

Python scripts manipulated TEI P5 XML, then the Karma Data Integration Tool
mapped EEBO-TCP data structures into the EEBOO ontology. Particular attention
was paid to dates encoded within strings, an example of rich semi-structured
data that can be extracted into structured RDF. Links to author records in

VIAF and the Library of Congress (LoC), and multimedia pages in the HTDL and
‘JISC Digital Books' website, were generated and added. Finally, author
names were aligned between the EEBOO and HTDL triples using a reconfiguration
of the SALT tool (Weigl et al. 2016).

24,926 EEBO-TCP Phase 1 records were processed, with 22 distinct types of
information in the headers, including 6 different ID types and 3 types of date
(publication date of historical work, author associated historical date(s), XML
publication/editing dates). EEBOO incorporates 7 of these datatypes, and
extends into subcategories for author names and date types. EEBOO contains 713
unique places, 6,489 unique expressions of Person of which 3,588 have VIAF and
LoC IDs.

[606-2]

Figure 2. Architecture providing cross-corpus worksets for early English print

Workset construction and viewing

A Virtuoso triplestore (see also, the Virtuoso Github repository) stores the
RDF data (totalling 1,137,502 triples) and provides a SPARQL query interface.
Figure 2 shows the overall system architecture. The workset constructor user
interface (Figure 3) allows the user to select parameters in a web
interface which are, in the background, assembled into SPARQL queries used to
create a workset. The interface automatically populates valid attributes that
are themselves retrieved from the triplestore, using ontological terms having
equivalent meaning across datasets. In combination, the generated triples and
SPARQL queries are fully sufficient for expressing the motivating workset
definitions described earlier.

The workset viewer (also Figure 3) then retrieves RDF workset contents, record
metadata, data links, and multimedia links (to the Historic Books collection or
the HTDL). Both web applications are written in Python, using the Flask
framework, and both rely on the semantic information encoded in RDF and queried
using SPARQL.

Current Workset Parameters    Elephant Workset Viewer p«[m].u.«r ;    All works
published in Oxford g> ® ®

[606-3]

Figure 3. Prototype workset constructor and viewer (example worksets shown)

Conclusion and future work

We have demonstrated the general feasibility of cross-corpus worksets in
bringing together HathiTrust content with specialised collections through
a specific implementation for early English printed books linking the
HathiTrust to EEBO-TCP. Using Linked Data, we see that metadata can be extended
in a piecemeal or iterative fashion, potentially moving beyond traditional
bibliographic metadata to include semantic structures emerging from scholarly
investigation of the worksets themselves; and in doing so support academic
motivations and requirements for workset creation.

Acknowledgements

We are grateful to our colleague Pip Willcox for her valuable input and
organisation of scholars' workshop, and Jacob Jett for his workset ontology.
This work was supported by the Andrew W. Mellon Foundation through the Workset
Creation for Scholarly Analysis project award.