Data mining digital libraries

The central theme for this workshop is data mining and the connection between metadata and data in the context of digital libraries. Digital resources and search engines raise several questions about the relationship between metadata and the data they describe. For example, what is the relationship between metadata keywords and classification categories (e.g. Dewey)? How should topics found by topic modeling algorithms be labelled? With readily available search engine technology, using document relevance based on content words, is there a need for library classification systems at all, like Dewey or UDC?

While there may be overlap between metadata and the texts described contentwise, metadata typically contain information not found within the text, such as author, geolocation and time data. In addition, subject or topic words typically consist of carefully constructed language models in the form of thesauri dedicated specifically towards specialized literary collections within different fields. The question is then how search engines may benefit from such metadata with a language model, and for what kind of library user?

In this workshop, we invite colleagues to discuss the application of various methods related to digital library resources, including the structure of the metadata itself, as well as digital book collections. Many resources are available to libraries in digital form, like journals and new book titles, while some libraries also have launched digitization programs to create digital libraries, using scanners and OCR technology.

Both the text data and the metadata of digital libraries can be scrutinized with data mining techniques, opening up the material for large-scale, quantitative analysis. This makes such collections highly relevant for Digital Humanities studies.

Background

The ongoing trend towards increased digitization in society in general poses numerous challenges at many levels, but also opens up for vast opportunities within many fields, including the library sector.

At the National Library of Norway, a mass digitization project was initiated in 2006, with the goal of digitizing the entire collection of books, newspapers, movies, radio- and television-broadcasts, music etc., in sum everything published in the public domain in Norway of all media types, i.e. the entire cultural heritage of Norway. For books, the goal is to have the entire stock digitized by 2017. Thus far, some 435.000 of 450.000-500.000 books have been digitized. When all books and newspapers have been digitized, we estimate that our Norwegian text corpus will consist of some 80 - 100 billion tokens, which is big for a rather small language like Norwegian with approximately 5 million speakers. In comparison, the Google Books corpus contains approximately 500 billion tokens for English.

The National Library cooperates with scholars of literary studies and linguistics in developing and applying methods of data mining to the digital collection. We develop services that make the content available for quantitative research, without challenging intellectual property rights. One such service is NB N-gram for Norwegian (see http://www.nb.no/sp_tjenester/beta/ngram_1/), comparable to Google Ngram Viewer for English and other languages.

Workshop leaders

Lars G. Johnsen: Research librarian at the Nation Library of Norway, PhD in linguistics. Fields of interest: semantics, grammar, philosophy of language, probability theory and applications. Email: Lars.Johnsen@nb.no, Phone: +47 23 27 61 84

Arne Martinus Lindstad: Research librarian at the National Library of Norway, PhD in linguistics. Fields of interest: corpus linguistics, language change, comparative syntax, negation. Email: arne.lindstad@nb.no, Phone: +47 23 27 62 11

Magnus Breder Birkenes: Research librarian at the National Library of Norway, PhD in linguistics. Fields of interest: corpus linguistics, history and dialectology of the Germanic languages. Email: magnus.birkenes@nb.no, Phone: +47 23 27 60 54

Target audience

Librarians, research librarians, scholars of literary studies, corpus and computational linguists

Length and Format

Half day:

09.00 - 09.30 Introduction and opening discussion 09.30 - 10.30 Slot 1: Structure of metadata, Data mining and library classification systems 10.30 - 11.00 Coffee break 11.00 - 12.00 Slot 2: Metadata and modeling 12.00 - 12.30 Wrap-up and final discussion

Budget

Coffee and snacks for the coffee break (max. 50€)

Technical requirements

A projector for presentations. Internet connection. We will bring our own computers.

Call for papers (cfp)

Are you interested in automatic classification of documents and what implications this has for libraries? How may search engines (like ElasticSearch) benefit from library metadata? Do you have any experience with developing public/academic web services on top of large amounts of library data? If these questions appeal to you, this workshop may be of interest. The central theme for this workshop is data mining and the connection between metadata and data in the context of digital libraries.

We invite papers on topics such as:

The structure of subject headings and descriptors, used in book classification (e.g. in building thesauri)

The relationship between topic words and library classification systems

The relationship between content words and topic words (of existing metadata, or as output from topic modeling algorithms)

Automatic classification of digital documents

Authorship attribution

Development of computational services for research and the general public

Legal issues arising with different data mining practices

Please send us an abstract of max. 500 words that is situated within the above context.

Program Committee

Oddrun Ohren (National Library of Norway)

Koenraad De Smedt (University of Bergen)

Anders Nøklestad (University of Oslo)

Elise Conradi (National Library of Norway)