A Framework to Quantify the Signs of Abandonment in Online Digital Humanities Projects Meneses Luis Electronic Textual Cultures Laboratory - University of Victoria, Canada ldmm@uvic.ca Martin Jonathan King’s College London jonathan.d.martin@kcl.ac.uk Furuta Richard Center for the Study of Digital Libraries, Texas A&M University furuta@cse.tamu.edu Siemens Ray Electronic Textual Cultures Laboratory - University of Victoria, Canada siemens@uvic.ca 2014-12-19T13:50:00Z Name, Institution
Street City Country Name

Converted from a Word document

Paper Short Paper degradation abandonment preservation online digital humanities projects digital archives and digital libraries data mining / text mining English computer science and informatics sustainability and preservation
Introduction

As researchers in the digital humanities we have been successful in building online components for our work. However, we have failed in making it a priority to devise a plan to gracefully discard our online components once we no longer need them. Thus, many of the online projects in the digital humanities have an implied planned obsolesce —which means that they will degrade over time.

Previous work presented in Digital Humanities 2017 and 2018 has explored the abandonment, and the average lifespan, of online projects in the digital humanities using metadata from HTTP response headers (Meneses and Furuta, 2017) and contrasted how things have changed over the course of a year (Meneses et al., 2018). We believe that managing and characterizing the degradation of online digital humanities projects is a complex problem that demands further analysis because the methods for identifying change in the Web do not fully apply; and the end of life for a digital humanities project may or may not be indicated by updates in its content and tools.

In this sense “abandonment” is not necessarily a sufficient designation —as there are different nuances involved. We have seen many cases of successful projects in digital humanities that are shifting their focus from active development to data management (for example: http://cervantes.tamu.edu and http://botany.csdl.tamu.edu/). These are cases where a project’s online presence has not received updates for some time but its online tools are stable and continue to be accessed by its users. In this case, the lack of updates and new content is not a signal of abandonment. These are examples of why the rules for traditional resources do not fully apply and new metrics are needed to identify issues concerning online projects in the digital humanities.

In this abstract, we go one step further into exploring the collectively shared distinctive signs of abandonment to quantify the planned obsolesce of online digital humanities projects. For this purpose, we have created a framework that collectively quantifies their signs of abandonment. This study aims to answer three questions. First, can we systematically identify the signals of abandoned projects using computational methods? Second, can the degree of abandonment be quantified? And third, what features are more relevant than others when identifying instances of abandonment?

Methodology

A complete listing of research projects in the Digital Humanities does not exist. However, the Alliance of Digital Humanities Organizations publishes a Book of Abstracts after each Digital Humanities conference as a PDF. Each one of these volumes can be treated as a compendium of the research that is carried out in the field. To create a dataset, we downloaded the Books of Abstracts from 2006 to 2018. Then we proceeded to extract the text from these documents using Apache Tika (Apache Software Foundation, 2018) and parse the unique URLs for each Web resource using regular expressions.

Then we periodically created a set of WARC files (International Organization for Standardization, 2017) for each resource using Wget (Free Software Foundation, 2018). The WARC files are systematically processed and analyzed using Python (van Rossum, 1995) and Apache Spark (Apache Software Foundation, 2017) to create a hash that represents their contents —pinpointing changes over time— and to extract the analytics that we used in our statistical analysis. More specifically, our analysis has two parts that incorporate the retrieved HTTP response codes, number of redirects, a detailed examination of the contents and links returned by traversing the base node, external resources, HTTP headers and linked files. Figure 1 shows the workflow that we used in our framework to quantify the sings of abandonment.

Figure 1: Workflow to quantify sings of abandonment in Online Digital Humanities Projects

First, we carried out a preliminary classification of the websites into two groups depending on their correctness according to their HTTP response codes: valid (responses in the 200 and 300 range) and decayed (all other response codes). If a Web resource reports more than one redirect, we placed it in the decayed category. This is a preliminary classification because a Web resource could return an HTTP response code implying correctness while showing erroneous content (Meneses et al., 2012: 404) —justifying the second part of our analysis where we cluster the contents of each Web resource in the valid category. We perform the clustering using topic modeling and Term frequency–Inverse document frequency (Tf-Idf).

The textual contents and the links associated with shared resources are the most obvious feature for clustering. Previous work has shown that shared resources are the first to disappear from the Web (SalahEldeen and Nelson, 2012) —which we interpret as premature indications of degradation. To detect these early signs, we generated topic and term frequency models to examine the similarity among the documents in a given project (the contents of the base node and the metadata and the contents of the child nodes). We used Latent Dirichlet Allocation (LDA) to model the content of the text (Blei et al., 2003) and a simple Tf-Idf ranking function to measure and compare them. This ranking function is based on adding the Tf-Idf values for the documents linked to a Web resource, which were calculated using the terms from the topic modelling as a vocabulary. This combination metrics and techniques allow us to compare and assess the degree of change of online digital humanities projects over time.

Conclusions

In this this study we aim to computationally identify the indicators of the abandonment of digital humanities projects —a very specific domain— as well as quantify their degrees of neglect. It is important to highlight that not all projects are equal and thus require different levels of attention. Previous work in this area was based on the metadata from HTTP headers —emphasizing the need for a framework that utilizes robust metrics to identify the collectively shared indicators of degradation. We intend this study to be a step forward towards better preservation mechanisms and for adopting strategies for the planned obsolesce of digital humanities projects.

Bibliography Apache Software Foundation (2017). Apache Spark: Lightning-fast cluster computing http://spark.apache.org (accessed 11 April 2017). Apache Software Foundation (2018). Apache Tika – Apache Tika https://tika.apache.org/ (accessed 25 November 2018). Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3: 993–1022. Free Software Foundation (2018). GNU Wget https://www.gnu.org/software/wget/ (accessed 25 November 2018). International Organization for Standardization (2017). ISO 28500:2017 WARC File Format http://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/06/80/68004.html (accessed 25 November 2018). Meneses, L. and Furuta, R. (2017). Shelf life: Identifying the Abandonment of Online Digital Humanities Projects Paper presented at the Digital Humanities 2017, Montreal, Canada. Meneses, L., Furuta, R. and Shipman, F. (2012). Identifying ‘Soft 404’ Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections. In Zaphiris, P., Buchanan, G., Rasmussen, E. and Loizides, F. (eds), Theory and Practice of Digital Libraries, vol. 7489. (Lecture Notes in Computer Science). Springer Berlin Heidelberg, pp. 197–208 http://dx.doi.org/10.1007/978-3-642-33290-6_22 http://link.springer.com/chapter/10.1007%2F978-3-642-33290-6_22. Meneses, L., Martin, J., Furuta, R. and Siemens, R. (2018). Part Deux: Exploring the Signs of Abandonment of Online Digital Humanities Projects Paper presented at the Digital Humanities 2018, Mexico City. SalahEldeen, H. M. and Nelson, M. L. (2012). Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?. Theory and Practice of Digital Libraries. (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, pp. 125–37 doi:10.1007/978-3-642-33290-6_14. https://link.springer.com/chapter/10.1007/978-3-642-33290-6_14 (accessed 1 August 2018). van Rossum, G. (1995). Python Tutorial, Technical Report CS-R9526. Amsterdam: Centrum voor Wikunde en Informatica (CWI) https://ir.cwi.nl/pub/5007/05007D.pdf.