Faceted Search for Discovering Software Odijk Jan Utrecht University, Netherlands, The j.odijk@uu.nl 2014-12-19T13:50:00Z Name, Institution
Street City Country Name

Converted from a Word document

Paper Long Paper Metadata for software faceted search corpus and text analysis history and historiography metadata natural language processing speech processing linguistics English
Introduction

Enabling the easy discovery of resources is an important service for Digital Humanities scholars. In CLARIN, the Virtual Language Observatory (VLO) serves this purpose, but it is currently mostly suited for the discovery of data. Discovering software is not so easy in the current VLO. In order to address this issue, we present a proposal for faceted search in metadata for software, which is based on a CLARIN Component Metadata Infrastructure (CMDI) profile for the description of software that enables discovery of the software and formal documentation of aspects of the software. We have tested the profile and the faceted search based on this profile by making metadata for over 80 pieces of software, and by creating an implementation of the faceted search. It is available on the web.

http://portal.clarin.nl/clariah-tools-fs

We propose to add this faceted search to the VLO, and show how metadata curation software, combined with provided metadata curation files, can curate existing metadata descriptions for software using other profiles to make them suited for such faceted search (implemented for 280 other pieces of software, also available on the web

http://portal.clarin.nl/clariah-tools-fs-global

).

Metadata Profile ClarinSoftwareDescription

The ClarinSoftwareDescription (CSD) profile

http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1342181139640/xsd

enables one to describe information about software in accordance with the CMDI metadata framework (Broeder et al., 2012). The profile has been set up in such a way that it enables (1) the description of properties that support discovery of the resource, and (2) the description of properties for documenting the resource, in as formal a manner as possible.

Since the focus of this paper is on the faceted search for software, we only briefly describe some aspects of the profile.

The profile consists of the CMDI components Generalinfo, SoftwareFunction, SoftwareImplementation, Access, ResourceDocumentation, SoftwareDevelopment, TechnicalInfo, Service and LRS.

The SoftwareFunction component enables one to describe the function of the software in terms of the closed vocabulary elements tool category , tool tasks , research phase(s) (for which it is most relevant), research domains and, for the linguistics domain, relevant linguistic subdisciplines for which it was originally developed.

The SoftwareImplementation component enables one to describe information for users on the implementation and installation of the software. The most important components are for the description of the interface , the input and the output of the software.

The Service component (CLARIN-NL Web Service description) is intended for describing properties of web services. It is compatible with the CLARIN CMDI core model for Web Service description version 1.0.2.

The LRS component is intended for the description of the properties of a particular task for the CLARIN Language Resource SwitchBoard (CLRS, (Zinn, 2016)). It is our viewpoint that specifications for an application for inclusion in the CLRS registry should be derivable from the metadata for this application. We devised a script to turn a CSD-compatible metadata record that contains an LRS component into the format required for the CLRS and tested it successfully with the Frog web service and application (Van den Bosch et al., 2007).

Semantics

Many of the profile’s components, elements and their possible values have a semantic definition by a link to an entry in the CLARIN Concept Registry (CCR, (Schuurman et al., 2016)). For the ones that were lacking we created definitions and provided other relevant information required for inclusion into the CCR. These are currently being evaluated by the CCR coordinators for inclusion in the CCR. After our submission to the CCR, we made some new modifications to the profile, so there are new elements and values for which the semantics does not yet exist.

Metadata Descriptions using the CSD profile

We have described more than 80 software resources with the CSD profile. Describing these software resources resulted in various improvements of earlier versions of the profile. Most descriptions started from the information contained in an unformalized software overview. The information from this overview was semi-automatically converted to CMDI metadata in accordance with the CSD profile. The resulting descriptions were further extended and then submitted to the original developers and CLARIN Centres that host the resources for corrections and/or additions.

Faceted Search

A major purpose of metadata is to facilitate the discovery of resources. An important instrument for this purpose in CLARIN is the Virtual Language Observatory (VLO, ( Van Uytvanck, 2014)). The VLO offers faceted search for resources through their metadata, but its faceted search is fully tuned to the discovery of data . For this reason, we defined a new faceted search, specifically tuned to discovery of software . This faceted search offers search facets and display facets:

Search Facets LifeCycleStatus, ResearchPhase, toolTask, researchDomain, linguisticsSubject, inputLanguage, applicationType, NationalProject, CLARINCentre, input modality, licence

Display Facets name, title, version, inputMimetype, outputMimetype, outputLanguage, Country, Description, ResourceProxy, AccessContact, ProjectContact, CreatorContact, Documentation, Publication, sourcecodeURI, Project, MDSelfLink, OriginalLocation

Furthermore, all metadata profiles for the description of software must be able to provide the values for the facets. That is the case to a large extent, though some metadata curation is needed in some cases and existing values must be mapped to a restricted vocabulary for use in the faceted search.

Curation of existing metadata for software

The basic idea is as follows: we create a new standardised metadata record for all software descriptions, in principle each time a record is harvested. This metadata record contains the components and elements that are required for the faceted search as defined above. The record is constructed from the original CMDI record for the resource, combined with the data for this resource contained in a curation file, by a script. The curation file can be used to add information that was lacking or only present in an unformalised way, and it can be used to map existing values to other values, e.g. to restrict them to a specific closed vocabulary. We report on our experiments with such a curation file for the WebLichtWebService profile, since it was most needed and most complex for this profile. Over 280 WebLicht Web Services can now be found with the faceted search.

Concluding Remarks

The faceted search is publicly available and in use by digital humanities researchers. We already received feedback from users for further improvements, which we hope to make in the course of 2019. We will also describe some problems we encountered, which we only briefly mention here: (1) definition of closed vocabularies (2) several technical problems related to the CLARIN Component Registry (3) lack of good CMDI metadata editors. Finally, we will identify some future work, in particular on deriving CLRS registry entries for CLAM-based applications and web services,

https://proycon.github.io/clam/

for extracting metadata information from independent initiatives such as codemeta.

https://codemeta.github.io/

References

Daan Broeder, Menzo Windhouwer, Dieter Van Uytvanck, Twan Goosen, and Thorsten Trippel . (2012). CMDI: A component metadata infrastructure. In Proceedings of the LREC workshop ‘Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR’. , pages 1–4, Istanbul, Turkey. European Language Resources Association (ELRA).

Davor Ostojic, Go Sugimoto, and Matej Ďurčo . (2017). The curation module and statistical analysis on VLO metadata quality. In Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 2628 October 2016, number 136 in Linköping Electronic Conference Proceedings, pages 90–101. Linköping University Electronic Press, Linköpings Universitet.

Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, and Daniel Zeman . (2016). CLARIN Concept Registry: The New Semantic Registry. In Koenraad De Smedt, editor, Selected Papers from the CLARIN Annual Conference 2015, October 1416, 2015, Wroclaw, Poland, number 123 in Linköping Electronic Conference Proceedings, pages 62–70, Linköping, Sweden. CLARIN, Linköping University Electronic Press. http://www.ep.liu.se/ecp/article.asp?issue=123&article=004 .

A. van den Bosch, G.J. Busser, W. Daelemans, and S. Canisius . (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In F. Van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste, editors, Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting , pages 99–114. Leuven, Belgium.

Dieter Van Uytvanck. (2014). How can I find resources using CLARIN? Presentation held at the Using CLARIN for Digital Research tutorial workshop at the 2014 Digital Humanities Conference , Lausanne, Switzerland. https://www.clarin.eu/sites/default/files/CLARIN-dvu-dh2014_ VLO.pdf , July.

Claus Zinn . (2016). The CLARIN language resource switchboard. https://www.clarin.eu/ sites/default/files/08%20-%20ZINN-Lg-Sw-Board.pdf . Presentation at the CLARIN 2016 Annual Conference.