Data Science & Digital Humanities: new collaborations, new opportunities and new complexities Alex Beatrice University of Edinburgh b.alex@ed.ac.uk Alexander Anne University of Cambridge raa43@cam.ac.uk Beavan David The Alan Turing Institute dbeavan@turing.ac.uk Goudarouli Eirini The National Archives Eirini.Goudarouli@nationalarchives.gov.uk Impett Leonardo Bibliotheca Hertziana - Max Planck Institute for Art History impett@biblhertz.it McGillivray Barbara University of Cambridge bm517@cam.ac.uk McGregor Nora British Library nora.mcgregor@bl.uk Ridge Mia British Library mia.ridge@bl.uk 2019-04-25T08:02:00Z Name, Institution
Street City Country Name

Converted from a Word document

Paper Panel / Multiple Paper Session data science collaboration interdisciplinary teaching pedagogy and curriculum GLAM: galleries libraries archives museums interdisciplinary & community collaboration English computer science and informatics artificial intelligence and machine learning digital humanities (history theory and methodology)
Panel overview

David Beavan, The Alan Turing Institute; Barbara McGillivray, University of Cambridge & The Alan Turing Institute

This panel highlights the emerging collaborations and opportunities between the fields of Digital Humanities (DH), Data Science (DS) and Artificial Intelligence (AI). It charts the enthusiastic progress of the Alan Turing Institute, the UK national institute for data science and artificial intelligence, as it engages with cultural heritage institutions and academics from arts, humanities and social sciences disciplines. We discuss the exciting work and learnings from various new activities, across a number of high-profile institutions. As these initiatives push the intellectual and computational boundaries, the panel considers both the gains, benefits, and complexities encountered. The panel latterly turns towards the future of such interdisciplinary working, considering how DS & DH collaborations can grow, with a view towards a manifesto. As Data Science grows globally, this panel session will stimulate new discussion and direction, to help ensure the fields grow together and arts & humanities remain a strong focus of DS & AI. Also so DH methods and practices continue to benefit from new developments in DS which will enable future research avenues and questions.

This panel has been enabled by the establishment two years ago of a new DS & DH Special Interest Group (SIG), with all co-organisers forming part of this panel. The group brings together 40+ practitioners, researchers and organisations from academia, the commercial sector and the cultural heritage institutions, all with an interest in exploring the benefits to DH from DS & AI and vice-versa. This includes new work in areas such as advanced statistics, algorithms, natural language processing, machine learning and artificial intelligence coupled with big data, high performance computing and high throughput computing techniques. At the same time, the SIG has created new directions and prospects for the institute in humanities scholarship, highlighting the unique challenges brought about, such as greater human interpretation and a more iterative research process, the vast potential of data sources, variety of modalities (text, audio, image, video etc.) of content and the social importance of research findings. For more information about the group, see https://www.turing.ac.uk/research/interest-groups/data-science-and-digital-humanities

The panel is structured as a series of short papers (c. 10 minutes each), introduced and facilitated by one of the co-authors, all addressing different perspectives and personal reflections of the intersection between DH, DS & AI. Starting with access to source material, the first paper will discuss the future of archives, focusing not only on the opportunities digitisation can bring, but also on how AI can enhance the role of archives, archivists and of course their users. The next paper moves to advanced Machine Learning experiments as pedagogical tools, how these can help us understand the creative process in a digital realm and what the experiments can tell us about ourselves. The infrastructure to support DS & DH work is essential, and the third paper draws on extensive experience from the cultural heritage sector on building collaborations and the complexities of data available (or not). The rise of these advanced digital methods and techniques has led to the need for advanced training for those working and providing access in the cultural heritage sectors, as discussed in the next presentation. A more personal reflection on the importance of interdisciplinary work in the DH, DS & AI space follows, with advice on what works and what can be improved. The final paper ends on a high, with the focus on the future, towards a manifesto of Data Science and Digital Humanities. There will be ample time (c. 20-30 minutes) to engage the panel and the audience through a moderated discussion to close the session.

What is the role of Data Science and Digital Humanities in the Future Archive?

Eirini Goudarouli, The National Archives

Trust has always been central to archives. However, the intangible record is fundamentally changing the landscape as well as the role of archivists and archival institutions. Undoubtedly, the emergence of new generation technologies is rapidly leading to an epistemological shift in archival science, or, to put it in Thomas Kuhn’s words to a scientific revolution, by moving from a relatively settled scientific framework to the urgent need for a profound change to its principles, methods and practices.

For example, the use of new technologies, such as Snapchat, Google Docs, neural networks, blockchains, hashing algorithms, cryptography and the cloud have profoundly altered the nature of archives, by disrupting how information is created, recorded, captured, encoded, curated, shared, made available and used. These shifts require fundamentally new capabilities and approaches on how best to capture, preserve, contextualise and present increasingly digital public record. Therefore, the archivist’s relationship to emergent technologies needs to become multi-layered. We need to understand the digital landscape, and the changing nature of how society creates and shares records in the light of new generation technologies, and be willing to apply these new advanced technologies in our archival response to these changes. This sparks a new era for archives, as today’s archivist must become equipped with emergent technologies as their own tools of the trade.

As the nature of records and archives evolves more quickly and the digital contests long-standing archival practices, trust comes to the forefront of the discussion. One of the major questions related to this, is how archives retain the legitimacy they confer on the digital evidence they capture, preserve, contextualise and present. Archives and collection-holding institutions should rethink the nature of the record and our archival practices around the record in the light of digital and combine practical considerations with explorative research into infrastructure, methodology, tools, techniques and user requirements, drawing on innovation across cultural heritage, academia and relevant industries. As we moving forward to a future fully-functioning digital archive, we need to embed new generation technologies in our recordkeeping practices to help us manage our rights and responsibilities as we go about capturing, preserving, contextualising and presenting digital records.

This paper will suggest that, in an era of AI-assisted recordkeeping, the need to demonstrate the adaptability, value and sustainability of archives has never been more acute. In this context, it will discuss the conceptual and epistemological challenges relating to trust and openness in rapidly evolving digital archives. The paper will also discuss the role of Data Science and Digital Humanities within the archival context and will argue that research plays a central role not only to inform but also to innovate around these challenges, helping to define future directions and lead to the shaping of the future archive.

Ways of machine seeing: experiments in critical pedagogy

Anne Alexander, University of Cambridge; Leonardo Impett, Bibliotheca Hertziana - Max Planck Institute for Art History

This paper explores the theoretical, methodological and practical challenges of developing Machine Seeing experiments as pedagogical tools, using as a case study a set of experiments created for graduate workshops between 2016 and 2019. These workshops aimed to facilitate interdisciplinary encounters between humanities scholars and computer vision researchers, and were framed by engagement with John Berger's seminal work Ways of Seeing (Berger 1972), which we see as offering critiques of visual ideology. In the spirit of Berger’s original documentary, our design experiments aim to appropriate image-machines (Clark 2008) in his case colour television, in ours Machine Learning systems) and turn them into critical pedagogical tools for the critique and re-understanding of historical and contemporary (digital) image cultures. The object of study is not merely digital or digitally-created images, but the ways in which computation informs our ways of seeing.

In attempting to combine hermeneutic criticism (of technology) with theoretically-engaged software development - not by alternating between them, but by doing each within the other - our workshop model is in the spirit of Critical Technical Practice (Agre 1997). It is precisely the difference of (and between) computer vision systems which makes them ideal tools for the critique of visual ideology. The object of this critique is not only contemporary digital image-culture, but historical (pre-computational) and machine-generated (post-human) images. We see this critique as fundamental to understanding creative practice in an age of computational automation, in order to develop methods of enquiry which render visible hidden social and technical dimensions of image creation and circulation, such as the production histories of commonly-used image training datasets, the mechanics of learning models (cost & optimisation functions), and the impact of factors such as personalisation in web search.

Machine visual perception is explored through several performative experiments, developed in the course of the workshops by ‘live-coding’, with technical experiments and critical discussion developing in parallel. In one experiment, Generative Adversarial Networks, twin pairs of image-generating and image-decoding networks, are used to create infinite series of images from certain classes of dataset - Google image search-results from different professions, high-rent and low-rent photographs from rental websites, or classical paintings. In another experiment, object detection networks semantically identify segments of images, and live-swap the resulting crops.

Our experiments encouraged workshop participants to see Machine Learning systems simultaneously as lenses and mirrors. In other words, ML systems are not simply devices through which we can view the world; we can make them show us something about ourselves. Our experiments thus attempted to reveal the ways of knowing encoded in ML systems through a kind of Verfremdungseffekt where the strangeness of machine visual perception causes viewers to question aspects of the composition and context of images which they previously took for granted. Like Berger’s experiments discussing Caravaggio with school children, we utilised the GAN as a surrogate for a human ‘naive observer’ in order to encourage workshop participants to temporarily ‘unlearn’ implicit aspects of their own image culture in order to facilitate deeper human understanding.

In search of the sweet spot: infrastructure at the intersection of cultural heritage and data science

Mia Ridge, British Library

This paper explores some of the challenges and paradoxes in the application of data science methods to cultural heritage collections. It is drawn from long experience in the cultural heritage sector, predating but broadly aligned to the 'OpenGLAM' and 'Collections as Data' movements. Experiences that have shaped this thinking include providing open cultural data for computational use; creating APIs for catalogue and interpretive records, running hackathons, and helping cultural organisations think through the preparation of 'collections as data'; and supervising undergraduate and MSc projects for students of computer science.

The opportunities are many. Cultural heritage institutions (aka GLAMS - galleries, libraries, archives and museums) hold diverse historical, scientific and creative works – images, printed and manuscript works, objects, audio or video - that could be turned into some form of digital 'data' for use in data science and digital humanities research. GLAM staff have expert knowledge about the collections and their value to researchers. Data scientists bring a rigour, specialist expertise and skills, and a fresh perspective to the study of cultural heritage collections.

While the quest to publish cultural heritage records and digital surrogates for use in data science is relatively new, the barriers within cultural organisations to creating suitable infrastructure with others are historically numerous. They include different expectations about the pace and urgency of work, different levels of technical expertise, resourcing and infrastructure, and different goals. They may even include different expectations about what 'data' is – metadata drawn from GLAM catalogues is the most readily available and shared data, but not only is this rarely complete, often untidy and inconsistent (being the work of decades or centuries and many hands over that time), it is also a far cry from datasets rich with images or transcribed text that data scientists might expect.

Copyright, data protection and commercial licensing can limit access to digitised materials (though this varies greatly). 'Orphaned works', where the rights holder cannot be traced in order to licence the use of in-copyright works, mean that up to 40% of some collections, particularly sound or video collections, are unavailable for risk-free use.(2012)

While GLAMs have experimented with APIs, downloadable datasets and SPARQL endpoints, they rarely have the resources or institutional will to maintain and refresh these indefinitely. Records may be available through multi-national aggregators such as Europeana, DPLA, or national aggregators, but as aggregation often requires that metadata is mapped to the lowest common denominator, their value for research may be limited.

The area of overlap between 'computationally interesting problems' and 'solutions useful for GLAMs' may be smaller than expected to date, but collaboration between cultural institutions and data scientists on shared projects in the 'sweet spot' - where new data science methods are explored to enhance the discoverability of collections - may provide a way forward. Sector-wide collaborations like the International Image Interoperability Framework (IIIF, https://iiif.io/) provide modern models for lightweight but powerful standards. Pilot projects with students or others can help test the usability of collection data and infrastructure while exploring the applicability of emerging technologies and methods. It is early days for these collaborations, but the future is bright.

Computing in Cultural Heritage: adventures in training GLAM staff to support digital scholarship

Nora McGregor, British Library

The cultural heritage sector of galleries, libraries, archives and museums (GLAMs) shares a fundamental purpose to preserve and provide access to digital and digitised culture, and to support and stimulate the innovative use of these extensive digital collections and data in research. Professionals across the sector will often have come to their role, many years ago, having deep domain expertise in a particular academic subject, yet now find themselves with increased responsibility for assisting on the design and delivery of complex digital projects (Parry, 2018), without a foundation in computing to empower them.

Though the GLAM sector is facing a broadening digital skills gap as rapid technological transformations impact nearly every role in the sector, there is much progress and innovation to report in the area of digital skills development in the area of supporting digital scholarship. From deploying simple scripts to make everyday tasks easier, advising on the building of new digital systems & services to support collaborative, computational and data-driven research, through to large-scale collaborative (Data Science and Digital Humanities) projects, today’s cultural heritage professionals require, and most importantly, are actively seeking, a grounding in computational approaches to navigate this new landscape confidently.

Through a diverse mixture of approaches, such as bespoke course development, guest lectures, network building, reading groups and hack & yacks, the British Library is finding a variety of innovative ways to provide their staff with the space and opportunity to delve into and explore all that digital content and new technologies have to offer in the research domain today (British Library, 2018). This paper will report on these latest initiatives, including a newly funded project to develop a Post-Graduate Certificate in Computing for Cultural Heritage (British Library, 2019) relaying experiences particularly from British Library, as well as other GLAM institutions, currently investing in upskilling their staff to support data science and the digital humanities. The paper will conclude with a discussion on matters of equality, diversity and inclusion, and report on efforts to identify and mitigate barriers for staff in taking up such digital skills training.

Benefits and challenges of interdisciplinary Data Science and Digital Humanities research

Beatrice Alex, University of Edinburgh

Data science (DS) and digital humanities (DH) collaborations can be extremely successful if participants involved are willing to find common ground and value each others’ research and contributions. Speaking from experience, they often take time to develop. The following section summarizes lessons learnt at hand by one of our co-authors in a number of interdisciplinary projects and some insights overlap with those shared by Siemens et al., 2016, for example.

Computer scientists tend to want to develop and evaluate algorithms or come up with new and interesting ways to visualize information. This requires data and in some cases manually annotated data. Humanists and social scientists tend want to prove or disprove hypotheses or study specific questions or queries using a particular dataset. Running projects successfully combining these goals requires willingness and compromise on each side. It is important to be clear about differences at the start and discuss how area-specific and joint dissemination will be handled. Project publications with many co-authors are something to embrace but they also need to be better recognized.

It is sometimes worth running a pilot to gain common understanding and knowledge of what is possible computationally. This can help to attract further funding to be able to scale up research in a larger project. In practice, that is not necessarily how it works as one often gets involved in last-minute grant proposals and needs to make quick decisions about what is feasible computationally, which is far from ideal.

Funding councils need to make more grants for interdisciplinary work available (with longer preparation times) both nationally and internationally. Small network grants are not sufficient. Leading on from them, we need more funding calls like Digging into Data and the Trans-Atlantic Platform allowing participants from different countries to collaborate. To drive research and innovation and to encourage interdisciplinary work, such as that between DS & DH, national research funding bodies like UKRI must also provide similar opportunities on a national level.

Once a project goes ahead, a lot of time is spent on data preparation. Often there is an understanding about the necessary datasets being available but complexities in terms of data access, data being spread across different collections or released in different formats and data quality are underestimated. For example, in the Trading Consequences and Palimpsest projects several large text collections were processed (Klein et al., 2014, Loxley et al., 2018). Converting them into one common format took time. Even with common guidelines being advocated (e.g. TEI or MARC), in reality most collections come with their own format and metadata, or no metadata at all. Those who have found themselves doing data wrangling will know that one rarely is able to re-use an existing conversion script. However, this initial, tedious preparation step is necessary and the experience gained doing it helps to get going faster each time.

Finally, computational methods are not always perfectly accurate. We need to inform collaborators about what technology, for example, a text mining pipeline, is capable of and how it can be used to assist scholarship, rather than replace it. When making a tool available, it is also important to report about the data it was originally developed for or trained on to avoid disappointment when used on data from a different domain. Sharing and being transparent about a technology can lead to positive outcomes and new interesting collaborations.

Towards a Manifesto for Data Science and Digital Humanities

Barbara McGillivray, University of Cambridge & The Alan Turing Institute; David Beavan, The Alan Turing Institute

This paper takes the first steps towards a manifesto for the field at the intersection between Data Science and Digital Humanities. Over the past few years there has been an increasing level of activity in the area of computational humanities, quantitative humanities, and humanities computing, as testified by several initiatives, groups (cf. e.g. Computational Humanities group at Leipzig https://ch.uni-leipzig.de/about/, the Computational Humanities committee https://www.ehumanities.nl/computational-humanities/), workshops (cf. e.g. Computational Humanities 2014 https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=14301, COMHUM 2018 http://wp.unil.ch/llist/en/event/comhum2018/), and publications (Biemann et al., 2014; Nyhan and Flinn, 2016, Jenset and McGillivray, 2017). Research in this area has focussed on developing new computational techniques to analyse and model humanities data.

Digital Humanities work has partially tackled these questions, but its scope is broader, and has historically concentrated primarily on the creation of digital resources, editions and tools, as well as the application of existing computational methods to humanities data (cf. Digital Humanities manifesto 2.0), rather than on new computational challenges.

We believe that the development of cutting-edge computational research that aims to promote and answer new research questions, including methodological ones, in the Humanities deserves a specific and coordinated reflection. Such effort is needed to shape the scope of the field, suggest a methodological framework for it, and highlight strategic priorities to drive it forward.

We propose to make this contribution in the form of a collaboratively-written “manifesto”. The structure of this document will first include a scoping section, which defines the types of challenges addressed by the field in question, such as: how do we automate and scale up tasks in Humanities research? How can we train machines to perform creative tasks? Further, the manifesto will briefly describe the current landscape, from the point of view of existing methods and approaches, projects, and institutional initiatives. Next, we will review the challenges and opportunities that characterize the research of this emerging field, to highlight areas of progress. Finally, we outline a future vision, providing a thought-leadership piece that suggests methodological frameworks and interdisciplinary models of collaboration to support the field, help it grow, and facilitate its broader dissemination. We will also provide grounding of an educational strategy to train the current and next generations of scholars and practitioners to operate in and interact with this field, and a reflection on the funding routes appropriate to support the research in this field.

DH2019 is the premiere Digital Humanities conference worldwide, with many opinion leaders expected to attend. Therefore, we look forward to presenting an initial outline of the manifesto in this ideal venue, gathering feedback and stimulating a discussion amongst the panel and the audience.

Bibliography Agre, P. (1997) Computation and Human Experience. Cambridge: Cambridge University Press Berger, J. (1972) Ways of Seeing . London: Penguin Books, 2008 Biemann, C., Crane, G., Fellbaum, C. and Mehler, A. (editors) (2014). Computational Humanities - bridging the gap between Computer Science and Digital Humanities. Report from Dagstuhl Seminar 14301. https://cpb-us-w2.wpmucdn.com/u.osu.edu/dist/4/27964/files/2016/01/DagstuhlSeminarFinalReport-2a7n3h7.pdf (accessed 14/3/2019) British Library . (2019) Computing for Cultural Heritage. https://www.bl.uk/projects/computingculturalheritage (accessed 29 April 2019) British Library . (2018) Digital Scholarship Staff Training Programme. https://www.bl.uk/projects/digital-scholarship-training-programme (accessed 29 April 2019) Clark, T. J. (2008) “Art History in the Age of Image Machines”. In Can’t Remember . London: Phaidon Press Jenset, G. and McGillivray, B. (2017). Quantitative Historical Linguistics. A corpus framework. Oxford University Press, Oxford Klein, E., Alex, B., Grover, C., Tobin, R., Coates, C., Clifford, J., Quigley, A., Hinrichs, U., Reid, J., Osborne, N. and Fieldhouse, I. (2014). Digging Into Data White Paper: Trading Consequences, March 2014. http://tradingconsequences.blogs.edina.ac.uk/files/2014/03/DiggingintoDataWhitePaper-final.pdf Loxley, J., Alex, B., Anderson, M., Hinrichs, U., Grover, C., Thomson, T., Harris-Birtill, D., Quigley A. and Oberlander, J. (2018). 'Multiplicity embarrasses the eye': The digital mapping of literary Edinburgh. In Gregory, I., Debats, D. and Lafreniere, D. (eds.), Routledge Handbook of Spatial History. https://www.routledgehandbooks.com/doi/10.4324/9781315099781-35 Nyhan, J. and Flinn, A. (2016). Computation and the Humanities: Towards an Oral History of Digital Humanities. Springer Parry, R., Eikhof, D. R., Barnes, S. A., & Kispeter, E. (2018). Mapping the Museum Digital Skills Ecosystem-Phase One Report. https://lra.le.ac.uk/bitstream/2381/41572/2/One%20by%20One_Phase1_Report.pdf (accessed 29 April 2019) Several authors . Digital Humanities Manifesto 2.0. www.humanitiesblast.com/manifesto/Manifesto_V2.pdf (accessed 14/3/2019) Siemens, L. and INKE Research Group (2016). “‘Faster Alone, Further Together’: Reflections on INKE’s Year Six.” Scholarly and Research Communication, 7.2, 8 pages. http://src-online.ca/index.php/src/article/view/250/479