Mining Texts with the Extracted Features Dataset

The HathiTrust Digital Library holds digitized copies of nearly 15 million scanned volumes from libraries around the world. These volumes are a significant resource for large-scale research: with their scale and breadth of material, a digital humanities scholar can make new inferences on history, language use, cultural trends, or even the structure of the printed word. However, access is complicated by the complexities of navigating copyright laws around the world, while use of the materials is impeded by the effort and technical demands of a researcher. To address both of these issues, the Extracted Features (EF) dataset from the HathiTrust Research Center (HTRC) provides volumes in a format that has already been cleaned, extracted, and tagged for computation use.

In this hands-on tutorial, participants will learn to use the Extracted Features dataset for text analysis alongside the HTRC Feature Reader library, equipping them to perform research on millions of publicly-accessible volumes. Through the HTRC Feature Reader, participants will be make use of popular data science tools in Python for EF dataset analysis, and will be left with demonstrative materials to repurpose in their future work.

Data

The Extracted Features (EF) dataset from the HathiTrust Research Center (Capitanu et al., 2015) provides an open and permissive download of page-level extracted information for every page of 4.8 million volumes from the HathiTrust Digital Library. A “feature” refers broadly to a quantitative measure of some property in a dataset; for example, the number of times a word appear on a page. The EF data features include part-of-speech-tagged term counts, line and sentence counts, counts of the characters occurring on the far left and far right side of a page, and inferred language probabilities. Most notably, this information is provided for every page. Also, headers, body, and footer have been identified and features are provided separately for each part.

In the tutorial, participants will learn the significance of each feature, such as using line counts and character information to identify the type of content on a page, or using part-of-speech tags for improving topic models based on content.

Skills

This tutorial introduces participants to introductory text analysis in Python using the Extracted Features dataset with the HTRC Feature Reader. This includes accessing term counts and other raw information, slicing within that data, visualization trends within or across books, and leading into advanced techniques like topic modeling and sentiment tagging.

The skills taught in this tutorial are underpinned by programming in Python using a popular set of data science libraries. All code examples are provided, though they are most useful if participants are comfortable in tinkering and have a familiarity with Python's basic conventions. Our intention is to make the code examples transparent enough so as to be easily modifiable by beginner users.

A participant completing the workshop will understand:

the structure and possibilities of the HTRC Extracted Features Dataset; how to access the EF dataset files, both for individual and bulk use; how to start a Jupyter notebook, an accessible browser-based approach to data science in Python; the fundamentals of reading volume files, accessing metadata, and slicing and grouping token lists; basic visualization of EF data; and an advanced analytic technique modeled on recent digital humanities methods, discussed below.

The first part of the tutorial teaches the fundamental skills for working with the HTRC Feature Reader. For the final exercise of the tutorial, participants have a choice from prepared advanced exercises, which instructors will assist individually. This structure accommodates more intensive approaches in the time given, while also leaving participants with more examples for practicing their newly-acquired skills in the future.

The advanced exercises to be provided are: classification of paratext, using features suggested by Underwood (2014); visualization of sentiment trends in books as a proxy for a plot arc as previously performed by Jockers (2015), and within-book topic modeling using the inference approach presented by Organisciak et al. (2015).

Summary

The EF dataset offers a diverse and incredibly large collections of works for analysis in an easily accessible way. By providing these works as already extracted data, the EF dataset covers a large part of the text analysis workflow for researchers and is thus particularly suited for learners. This tutorial will use EF to teach text analysis through Python, using new software called the HTRC Feature Reader. By the end, students will be able to slice, group, and manipulate individual volumes for their needs, and will be familiar with techniques for modeling texts, identifying pertinent pages, and plotting trends across books.

Bibliography Capitanu, B., Underwood, T., Organisciak, P., Bhattacharyya, S., Auvil, L., Fallaw, C., J. and Downie J. C. (2015). Extracted Feature Dataset from 4.8 Million HathiTrust Digital Library Public Domain Vol. 2,[Dataset]. HathiTrust Research Center, doi:10.13012/j8td9v7m. Jockers, M. L. (2015). Revealing Sentiment and Plot Arcs with the Syuzhet Package. Matthew L. Jockers. Blog. http://www.matthewjockers.net/2015/02/02/syuzhet/. Organisciak, P., Auvil, L. and Downie J. S. (2015). Remembering books: A within-book topic mapping technique. Digital Humanities 2015. Sydney, Australia. Underwood T. (2014). Understanding Genre in a Collection of a Million Volumes. Interim Report. https://dx.doi.org/10.6084/m9.figshare.1281251.v1.