Scholarly Requirements for Large Scale Text Analysis: A User Needs Assessment by the HathiTrust Research Center

Introduction

The HathiTrust Research Center (HTRC) aims to facilitate large-scale computational text analysis of the contents of the HathiTrust Digital Library (HTDL) through data services and analytical tools. We conducted a study of current and potential users of the HTRC to investigate how scholars integrate text analysis into their research. Our study aims to inform the development of HTRC services and also to generate deeper insights into scholarly research practices with large-scale digitized text corpora.

Background

Studies on the use of digital content by humanities scholars, ranging from humanities cyberinfrastructure (ACLS, 2006) and patterns in scholarly practices (Brockman et al., 2001; Palmer and Neumann, 2002; Green and Courtney, 2015), to discipline-specific studies (Zorich, 2012; Babeu, 2011; Rutner and Schonfeld, 2011), reveal that scholars acquire and analyze digital content in multi-faceted ways. Several investigations particularly examine scholarly uses of digital tools (Frischer et al., 2006; Toms and O’Brien, 2008; Gibbs and Owens, 2012). Computational text analysis dates from the beginnings of humanities computing (Hindley, 2013), and the resources of the ARTFL Project (Argamon et al., 2009; Horton et al., 2009), MONK (Unsworth, 2011), Wordseer (Muralidharan and Hearst, 2013), Voyant and TaPOR (Rockwell et al., 2010), and Lexos (LeBlanc et al., 2013), among others, inform the current work of the HTRC to provide a secure computational and data environment for researchers to conduct analyses of content from the HathiTrust Digital Library.

Our study builds on an earlier user needs assessment conducted for the HTRC and its Mellon Foundation-funded Workset Creation for Scholarly Analysis project. That earlier study analyzed interviews and focus groups in order to identify capabilities needed in large text corpora to facilitate scholarly research use (Fenlon et al., 2014). These desired capabilities included the ability to create and manipulate collections as reusable datasets and research products, the ability to work at different units of analysis, and access to highly enriched metadata (Green et al., 2014; Fenlon et al., 2014).

Our present study especially builds upon that previous investigation by examining the text analysis research practices of current and potential users of the HTRC.

Research Design

Goals

Our study’s primary goals are:

To analyze current scholarly research practices with textual corpora to identify user requirements for HTRC services; to develop illustrative use cases of text analysis research for shaping training curricula; and to obtain information for guiding the development of HTRC research services in the University of Illinois Library’s Scholarly Commons and similar digital scholarship centers.

While the findings of this study specifically will inform the development of services to meet the needs of HTRC users, it also contributes broader insights into how to develop similar digital resources and research services for computational text analysis.

Methods

We conducted fifteen semi-structured interviews with students, faculty, researchers, administrators, and librarians who pursue work that includes text analysis, or have familiarity with text analysis methods. Some participants were recruited at professional conferences for digital humanities and libraries, while others were active in HTRC user group forums. Several of the interviewees had previously interacted with the HTRC, and most had experience with the HTDL. The participants were from various disciplines — including English, Anthropology, History, and Computer Science —and ranged from newcomers to digital humanities to long-time researchers.

We performed an initial analysis of the interview data through open coding and will continue detailed qualitative analysis using ATLAS.ti. Data was independently coded by the authors to ensure inter-coder reliability. While we are still actively analyzing interview data, we identified several preliminary themes discussed here. These themes include strategies for obtaining and managing data, research workflows and results, collaborations, and teaching.

Analysis and Discussion

Data Acquisition and Management

Several respondents characterized text analysis research as being time-intensive in spite of the speed of computational tools. One interviewee noted, ‘It’s funny, often people think, “Oh we have it digitized, now it’s useful.” Scholars realize that you have a lot more work to do after that. And that can often slow projects down terribly.’

The interviewees indicated that gathering, managing, and manipulating text data comprised a considerable portion of their work. An interviewee explained, ‘I think the biggest challenge is data, getting good data to work with. I think people underestimate the problems and difficulties in doing that.’

Interviewees also expressed a desire for improved ways to identify and extract the content they needed, especially when navigating large-scale collections to find the volumes, pages, or passages relevant to a research project. As one interviewee remarked, ‘Even if you had somehow structured your texts, I would be saying, “What was left out? How do I bring it back in?”’

Research Workflows and Results

Several interviewees described the potential of text analysis to challenge previously held understandings of text, as differences between human and computational readings emerged. One respondent noted, ‘There are many cases in which the computer is at least as good—if not better—a reader than humans are. That’s very difficult for people to accept... sometimes the computer gets it right and it bears looking at that difference. So we kind of want to get that new ground truth on this kind of work.’

Many researchers highlighted the importance of interpretive work in understanding how the tools interact with the text, and characterized the interactions as dynamic. One respondent observed, ‘I yearn for workflows where the scholar could actually set their own tokenization rules.... It would be a way that we could create less language-specific [rules] or control the language specificity of the algorithm. I think that is the real need.’ Several respondents highlighted the importance of tools that flexibly fit into various stages of the research process, and also are accessible to users of different skill levels. Interviewees also suggested enhancements specific to the HTRC, which included expanded visualization capabilities, improved generation of statistics about text corpora, and better ability to handle languages other than English.

Research Collaborations

Interviewees repeatedly cited collaboration and research support, both virtual and in-person, as important. Many interviewees worked with digital humanities initiatives, and reported that their local resources ranged from limited technical support to well-resourced research centers. For some interviewees, online support communities— such as Digital Humanities Questions and Answers or Stack Overflow — also were significant.

Interdisciplinary collaborations between departments and across institutions emerged as the most prominent kind of partnership, but interviewees also noted the challenges that such collaborations pose. As one interviewee explained, ‘Collaborations between institutions: much more difficult. There’s money, there’s institutional blockages, and then anything over half a dozen people, it gets complicated very quickly. And so the people dynamics get very complicated.’ Some respondents noted that these collaborations affected their research practices and acquisition of research resources.

Interviewees reported that their collaborations with libraries ranged from non-existent to critical partnerships. Many saw the library as a key space because ‘the library is actually the one functioning interdisciplinary space on a university campus.’ Collaborations with the HTRC and digital repositories for working with data also were important to respondents.

Teaching and Training

Interviewees mentioned their active efforts and intentions to incorporate computational text analysis into their teaching. Some remarked on institutional constraints that make it difficult to incorporate computational tools into curricula. As one respondent explained: ‘I once imagined teaching a class in which students learn to script and actually run analyses against data, but I was told, basically, that that class isn’t a humanities class anymore—that belongs in computer science.’

Some stated that the courses that they currently teach may not require or allow for the incorporation of computational analysis. Yet others noted that there is only a limited amount of technical or scientific skills that a humanities student could realistically master within a short period of time, with one interviewee noting that ‘you can only get people to learn so much about the math; as much as they can learn, they should — at the same time, it’s hard.’

Although the demand from students for learning about computational text analysis was, overall, reported to be increasing, some interviewees noted that they are constrained by not only limited resources, but also uncertainty as to how to carry out such activities. One interviewee reported prevailing sentiments that the digital humanities ‘doesn’t even fit anywhere,’ leading to the question of whether ‘there should be a whole separate department that’s digital humanities,’ or to offer training within existing curricula.

Conclusion

The immediate aims of this study are to generate an updated framework of user requirements that will guide the development of the HTRC’s educational programming and research support services and also to inform forthcoming Mellon Foundation-funded development of the HTRC Data Capsule. But our preliminary findings also provide insights into scholars’ needs as they increasingly incorporate text analysis in research and teaching. These findings also reveal how digital scholarship centers, information professionals, and providers of digitized content can best support scholarship as digital humanities resources evolve.

Acknowledgements

We thank Megan Senseney, Angela Courtney, Nicholae Cline, and Leanne Mobley for their collaboration in this study.

Bibliography American Council of Learned Societies. (2006). Our Cultural Commonwealth: The report of the ACLS Commission on Cyberinfrastructure for the Humanities and Social Sciences. New York: American Council of Learned Societies. http://www.acls.org/uploadedFiles/Publications/Programs/Our_Cultural_Commonwealth.pdf (accessed 4 March 2016). Argamon, S., et al. (2009). Gender, Race, and Nationality in Black Drama, 1850-2000: Mining Differences in Language Use in Authors and their Characters. Digital Humanities Quarterly 3(2): http://www.digitalhumanities.org/dhq/vol/3/2/000043/000043.html Babeu, A. (2011). Rome wasn't digitized in a day: Building a cyberinfrastructure for digital classics. CLIR Publication, 150, Washington, DC: Council of Library and Information Resources. http://www.clir.org/pubs/reports/pub150/reports/pub150/pub150.pdf (accessed 4 March 2016). Brockman, W. S., et al. (2001). Scholarly work in the humanities and the evolving information environment CLIR Publication, 104, Washington, D.C.: Digital Library Federation, Council on Library and Information Resources. http://www.clir.org/pubs/reports/pub104/pub104.pdf (accessed 4 March 2016). Fenlon, K., et al. (2014). Scholar-built collections: A study of user requirements for research in large-scale digital libraries. Proceedings of the American Society for Information Science and Technology, 51(1): 1–10. Frischer, B., et al. (2006). Summit on digital tools for the humanities: Report on summit accomplishments. Institute for Advanced Technology in the Humanities, University of Virginia. http://www.iath.virginia.edu/dtsummit/SummitText.pdf (accessed 4 March 2016). Gibbs, F. and Owens, T. (2012). Building Better Digital Humanities Tools: Toward broader audiences and user-centered designs. Digital Humanities Quarterly, 6(2). http://www.digitalhumanities.org/dhq/vol/6/2/000136/000136.html (accessed 4 March 2016). Green, H. and Courtney, A. (2015). Beyond the Scanned Image: A Needs Assessment of Scholarly Users of Digital Collections. College and Research Libraries, 76(5): 690-707. Green, H. E., et al., (2014). Using Collections and Worksets in Large-Scale Corpora: Preliminary Findings from the Workset Creation for Scholarly Analysis Prototyping Project. Poster presented at iConference 2014, Berlin, Germany. Hindley, M. (2013). The Rise of the Machines: NEH and the Digital Humanities: the early years. Humanities, 34(4). http://www.neh.gov/humanities/2013/julyaugust/feature/the-rise-the-machines (accessed 4 March 2016). Horton, R., et al. (2009). Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie. Digital Humanities Quarterly, 3(2): http://www.digitalhumanities.org/dhq/vol/3/2/000044/000044.html LeBlanc, M. D., et al. (2013). Lexomics: Integrating the Research and Teaching Spaces. Digital Humanities 2013 Conference Abstracts, University of Nebraska–Lincoln, 16-19 July 2013. Lincoln, NE: Association of Digital Humanities Organizations, pp. 274-76. http://dh2013.unl.edu/abstracts/ab-293.html (accessed 4 March 2016). Muralidharan, A. and Hearst, M. A. (2013). Supporting Exploratory Text Analysis in Literature Study. Literary and Linguistic Computing, 28(2): 283-95. 10.1093/llc/fqs044. Palmer, C. L. and Neumann, L. J. (2002). The Information Work of Interdisciplinary Humanities Scholars: Exploration and Translation. Library Quarterly 7(1): 85-117. Rockwell, G., et al. (2010). Ubiquitous Text Analysis. Poetess Archive Journal, 1(2). https://journals.tdl.org/paj/index.php/paj/article/view/13 (accessed 4 March 2016). Rutner, J. and Schonfeld, R. (2012). Supporting the Changing Research Practices of Historians. New York: Ithaka S+R. http://sr.ithaka.org/?p=22532 Sukovic, S. (2011). E-Texts in Research Projects in the Humanities, A. Woodsworth and W. D. Penniman (eds.), Advances in Librarianship. Bingley, UK: Emerald Group Publishing, pp. 131-202. Toms, E. G. and O’Brien, H. (2008). Understanding the Information and Communication Technology Needs of the E-Humanist. Journal of Documentation, 64(1): 102-30. Unsworth, J. (2011). Computational Work with Very Large Text Collections: Interoperability, Sustainability, and the TEI. Journal of the Text Encoding Initiative 1, 10.4000/jtei.215 (accessed 4 March 2016). Zorich, D. (2012). Transitioning to a Digital World: Art History, Its Research Centers and Digital Scholarship: A Report to the Samuel H. Kress Foundation and the Roy Rosenzweig Center for History and New Media. New York: Samuel H. Kress Foundation. http://www.kressfoundation.org/research/Default.aspx?id=35379 (accessed 4 March 2016).