Unanticipated Afterlives: Resurrecting Dead Projects and Research Data for Pedagogical Use

Overview

Pedagogical exercises in the digital humanities rely on student access to humanities data. While strategies range from instructor-prepared datasets (Sinclair and Rockwell, 2012) to having students digitize texts directly from print materials (Croxall, 2017), data repositories and web-based DH projects are two of the most attractive sources for identifying, appraising, and accessing data for classroom use.

Yet data for teaching is rarely cited as a prime motivation or rationale for sharing research data. In “The Conundrum of Sharing Research Data,” Christine Borgman examines four rationales for sharing research data: (1) to reproduce or to verify research, (2) to make results of publicly funded research available to the public, (3) to enable others to ask new questions of extant data, and (4) to advance the state of research and innovation (2012). Pedagogy may be included implicitly in the third rationale, but by foregrounding pedagogical intentions, we can more readily operationalize a process for how we enable others to ask new questions of our data, which, in turn, will inform our motivations for sharing as well as the manner in which we do so.

Web-based DH projects are often conceived and developed for public consumption with short-term support through grant funding. While initiatives such as these have proliferated since the 1990s, they often languish as legacy projects on institutional servers without clear plans for sustainability or sunsetting (Rockwell et al., 2014). Rather than construe long dormant projects as an institutional burden, these artifacts may continue to function as object lessons and raw materials for use in the DH classroom. Evaluating early digital projects based on their fitness for use as pedagogical datasets distinguishes the project from its component parts and allows aspects of the project to live on in new contexts.

This panel will include representatives from five public research universities across the United States. We will begin with a brief overview, followed by four case studies. Each panelist will speak for fifteen to twenty minutes, leaving time for questions from—and conversations with—the audience. Cases are drawn from the DH 101 course at UCLA, the DH Librarianship course at the University of Washington, the University of Miami Libraries’ Legacy Site Adoption Project, and the Humanities Data workshop at DHOxSS. Our goal is to explore the intersection of data sharing and digital pedagogy to interrogate how past projects (whether formally archived or otherwise) are adopted as data sets for teaching and training; propose evaluation criteria for selecting these data sets; discuss what these classroom efforts indicate about the sustainability of DH projects (and their data); and examine how our knowledge of these classroom cases might inform curatorial decisions in active DH projects.

Learning from our mistakes: Using old projects to create better library/faculty collaborations

The Legacy Sites Adoption Project (LSAP) developed in response to what the library administration saw as a significant problem: the library website hosted nearly 40 digital projects built 5-20 years earlier by a former library faculty member, now malingering in various states of brokenness, but still placed prominently on the website. Retiring and removing the sites would erase the memory of the library’s institutional history, but repairing them would create an impossible burden for the web & application development team; and would reinforce the idea of the library playing a service-and-support role in DH, rather than an active partnership.

The solution that we are currently implementing is to experiment with making the legacy sites “adoptable”: the content and metadata of each site are made available as a zip file containing CSVs of data and metadata and accompanying images/audio/visual files, along with a readme pointing both to the current site on the library servers and an archived (and often more functional) version of the site in the Internet Archive’s Wayback Machine. Faculty and students are able to use the zip files as base material for creating their own version(s) of the original sites, either carrying on the original concept as stated or taking it in a new direction. The original versions of the sites present opportunities for classes to think about developing DH projects with a direct focus on revision -- potentially reading and critiquing the original sites through the lenses of recent scholarly essays, or considering the choices made by the original creators in the light of how DH practices and tools have changed since the sites were built.

LSAP engages with ongoing questions about what makes a good entry point into digital humanities work. Instead of building entry points around particular tools (Omeka, Voyant, etc.); or around a particular research question or collection of material that is not a project yet, adopting legacy sites centers and foregrounds the iterative nature and inevitable fragility of project webpages, while making explicit the relationship between the websites and the flat files of their content.

With LSAP, we are also attempting a positive intervention into collaborative relationships between departmental faculty and librarians. Frequently, faculty come to librarians to ask for support for a particular idea for a digital project; or to incorporate digital methodologies into a classroom setting. In such instances, the faculty member may have little experience or knowledge with various key factors, including scoping and scaling project milestones, the availability of digitized objects, copyright/permissions restrictions, and the affordances of out-of-the-box tools. Our hope is that by offering projects that are ripe for revision, and focusing on areas that are frequently taught and studied at the university, we can provide an entry-point for collaboration that is more appropriately bounded, resulting in less uncertainty and less labor-intensive experiences for faculty, students, and librarians.

Awakening sleeping data for the DH classroom

As anyone who teaches digital humanities knows, humanities-related datasets are as hard to find as they are desirable. Since the closure of the Arts and Humanities Data Service in 2008, no centralized repository for humanities data has emerged. The DH instructor is faced with the necessity of scouring the web for data to share with students so that they can practice data-cleaning, -manipulation, and -visualization. Sometimes this data comes from libraries, archives, and museums, but it comes just as often from scholars’ long-hibernating research projects. Indeed, scholars are often surprised to learn that their data has taken on a new life as the basis for student projects.

The last several decades have seen explosive growth in flexible, accessible tools for working with data. These new platforms offer possibilities for visualization and analysis that would only have been possible with custom programming just 10 or 15 years ago. Because of this palette of tools, even relatively inexperienced students can breathe new life into data left mostly untouched for years.

This presentation offers some case studies of student projects built on “dormant” data, explaining how students are trained to analyze, contextualize, visualize, and make sense of data they had no involvement in collecting. It discusses best practices for providing this data, as well as a scaffolded approach to helping students become conversant in techniques for understanding and working with data. It suggests a “toolkit” of off-the-shelf platforms that are affordable and easy for students to grasp and shows how one can build on the other until even novice students are able to create full-fledged, sophisticated digital humanities projects in the space of a semester.

For those who have collected data they wish to share with students, this presentation offers some suggestions for documenting, packaging, and contextualizing research data so that it is not only technically sound, but in a format that students can understand. It also offers a set of best practices for collaborating with students on a data-based research project, including methods for sharing, documenting, citing, and reusing data.

Fit for use: Repurposing research data, reconstructing provenance, and refining “clean” data

When it comes to teaching materials, data curation education may have become a victim of its own success: finding “dirty” data for classroom use is persistently difficult, in part because most published datasets have already been cleaned and curated! However, there are teachable moments to be found even when working with relatively “clean” data. Published data can be mined, re-structured, re-formatted and otherwise curated for new uses. Additionally, the process of tracking down and contextualizing already published datasets can prove instructive in and of itself. The detective work needed to understand someone else’s project, and to reconstruct its provenance, can reveal unexpected idiosyncrasies about the dataset, and thereby reveal useful data wrangling skills to be taught.

In this talk, we describe our work finding, curating and reconstructing the provenance of “The Pettigrew Papers,” a published (and relatively clean) dataset we have used over two years of teaching week-long workshops on digital humanities data curation at the Digital Humanities at Oxford Summer School (DHOxSS). Thomas J. Pettigrew (also known as Thomas “Mummy” Pettigrew) was a Victorian surgeon, antiquarian, and Egypotologist. Pettigrew wrote several early texts on Egyptian mummies and was the founding treasurer of the British Archaeological Association. Though his correspondence is archived at Yale University’s Beinecke Rare Book and Manuscript Library, it came to our attention via a “data paper” published in the Journal of Open Archaeology Data (Moshenska, 2012), containing transcriptions of select letters.

In our first year teaching with the Pettigrew dataset, we wrote simple Python scripts to mine named entities from the letters, and to pull out header information about the letters as a spreadsheet for cleaning in OpenRefine. In hands-on sessions, we asked students to consider how they would clean and curate the dataset for new uses: what steps would need to be taken to create a network diagram of the entities named in his letters? To create a map of his correspondents? To create a timeline?

In our second year teaching with this dataset, we spent more time reconstructing the original provenance of the Pettigrew letters themselves. In addition to the hands-on sessions from the first year, we asked students to consider how they might improve the metadata for the original data paper, and how they might resolve discrepancies between the data paper and the original finding aids created by the Beinecke (Ducharme, 2010). We additionally discussed how they might incorporate copies of Pettigrew’s publications available in the HathiTrust Digital Library in their work.

Overall, we found that asking students to clean and re-curate this already published dataset was only the starting point in our teaching; as we found further connections in digital libraries and archives beyond the original data paper, we identified subtle and important issues in the digital humanities and digital curation that guided our workshop design. In addition to teaching hands-on data cleaning and manipulation skills, we found it important to teach students a nuanced understanding of provenance: both in the sense of the archival “chain of custody” that contextualizes and validates a fonds, and in the sense of the processes that led to a dataset’s current form.

Training DH librarians: Using old DH projects to move forward

The DH Librarianship course at the University of Washington Information School investigates the multiple roles librarians play in DH scholarship and prepares students for a wide range of career options in libraries, DH centers, and academic departments. DH librarian roles range from fully-credited collaborator with faculty to last-minute data cleaner, and everything in between. DH librarians also need to be prepared to support projects and research across the spectrum of disciplines, so we examine varying research methods across the humanities. The final project for the course asks that students locate an abandoned, or complete but aging DH project, and insert themselves as a librarian; they provide an evaluation of the content as well as the technology of the project and suggest ways to improve or update both.

The data sets in these projects varies and examples include: hand-collated quotations by a famous author on a fan site; census numbers provided in a project about London families in the 17th century; a list of shooting locations for a television show; metadata for photos of logging camps in the Pacific Northwest; multimedia elements in a documentary film; boxes of music programs from a summer camp; and quilting patterns.

Some projects also include the more typical (and larger) type of data set, such as those from HathiTrust or Google-generated Ngrams, but they have proven to be the exception. Working with small data sets means that cleaning doesn’t occupy much time during a 10-week quarter, and they can be rearranged quickly to utilize multiple visualization or data processing options.

Students evaluate the data sets early in the process; in nearly all cases, data sets are either incomplete or inaccurate, and for some, updated data or other content is available. This is where the multi-disciplinary expertise of librarians comes in, as MLIS students are trained in searching out valid information sources from multiple perspectives, whether that’s using vendor-supplied databases, open web search engines, or (gasp) sources in print or microform. This is also where students begin to see the striation of roles between true collaborators, project leaders, subject specialists, technical consultants, or data-wranglers.

In reviewing aging or abandoned projects, students learn how easily the data, other content, and the functionality of the site/project can be lost. This gives them the added perspective they need to start thinking about curation and preservation, rather than tackling those issues as add-ons if they have time.

Through these immersive projects, students have a chance to see DH through multiple lenses: those of a potential user, a collaborator, and a disciplinary specialist. They learn how to re-create and improve on a project. In doing so, they gain experience in evaluating and collecting data as well as in multiple platforms and software that are prominent in DH (some current, some defunct). Some students also reach out to the original site or project owner, and in a few cases have worked with that person to update the project, putting preservation or stabilizing features in place for future users.

Bibliography Borgman, C. L. (2012). The conundrum of sharing research data, Journal of the Association for Information Science and Technology, 63(6): 1059-78. Croxall, B. (2017). Digital humanities from scratch: A pedagogy-driven investigation of an in-copyright corpus , Digital Humanities 2017: Conference Abstracts, Montreal: McGill University, pp. 206-7. Ducharme, D. J. (2010). Guide to the Pettigrew Papers OSB MSS 113. New Haven: Beinecke Rare Book and Manuscript Library. http://hdl.handle.net/10079/fa/beinecke.pettis1 (accessed 27 April 2018). Moshenska, G. (2012). Selected correspondence from the papers of Thomas Pettigrew (1791-1865), surgeon and antiquary. Journal of Open Archaeology Data, 1(0). https://doi.org/10.5334/4f913ca0cbb89 (accessed 27 April 2018). Rockwell, G., Day, S., Yu., J., and Engel, M. (2014). Burying dead projects: depositing the Globalization Compendium. Digital Humanities Quarterly, 8(2). Retrieved from http://www.digitalhumanities.org/dhq/vol/8/2/000179/000179.html (accessed 27 April 2018). Sinclair, S. and Rockwell, G. (2012). Teaching computer-assisted text analysis. In Hirsch, B. (ed) Digital Humanities Pedagogy: Practices, Principles, Politics. Open Book Publishers, pp. 241-54. Retrieved from https://www.openbookpublishers.com/product.php/161 (accessed 27 April 2018).