TY - RPRT AU - van Gompel, M. PY - 2019 DA - 2019// TI - FoLiA: Format for Linguistic Annotation - Documentation and Reference Guide IS - Language and Speech Technology Technical Report Series 19-01 PB - Radboud University UR - https://folia.readthedocs.io/en/latest/ ID - FOLIA19 ER - TY - RPRT AU - van Gompel, M. PY - 2018 DA - 2018// TI - CLAM Documentation IS - Language and Speech Technology Technical Report Series 18-03 PB - Radboud University UR - https://clam.readthedocs.io/en/latest/ ID - CLAM18 ER - TY - RPRT AU - van der Sloot, K. AU - Hendrickx, I. AU - van Gompel, M. AU - van den Bosch, A. AU - Daelemans, W. PY - 2018 DA - 2018// TI - Frog, A Natural Language Processing Suite for Dutch. Reference Guide IS - Language and Speech Technology Technical Report Series 18-02 PB - Radboud University UR - https://frognlp.readthedocs.io/en/latest/ ID - FROG18 ER - TY - RPRT AU - van Gompel, M. AU - van der Sloot, K. AU - I. Hendrickx AU - van den Bosch, A. PY - 2018 DA - 2018// TI - Ucto: Unicode Tokeniser. Reference Guide IS - Language and Speech Technology Technical Report Series 18-01 PB - Radboud University UR - https://ucto.readthedocs.io/en/latest/ ID - UCTO18 ER - TY - JOUR AU - Beeksma, Merijn AU - Van Gompel, Maarten AU - Kunneman, Florian AU - Onrust, Louis AU - Regnerus, Bouke AU - Vinke, Dennis AU - Brito, Eduardo AU - Bauckhage, Christian PY - 2018 DA - 2018/01/ TI - Detecting and correcting spelling errors in high-quality Dutch Wikipedia text JO - Computational Linguistics in the Netherlands Journal SP - 122 EP - 137 VL - 8 ID - CLIN28SHAREDTASK ER - TY - CHAP AU - van Gompel, M. AU - van der Sloot, K. AU - Reynaert, M. AU - van den Bosch, A. PY - 2017 DA - 2017// TI - FoLiA in Practice: The Infrastructure of a Linguistic Annotation Format BT - CLARIN in the Low Countries SP - 71 EP - 82 PB - Ubiquity Press AB - We present an overview of the software and data infrastructure for FoLiA, a Format for Linguistic Annotation developed within the scope of the CLARIN-NL project and other projects. FoLiA aims to provide a single unified file format accommodating a wide variety of linguistic annotation types, preventing the proliferation of different formats for different annotation types. FoLiA is being developed in a bottom-up and practice-driven fashion. We have invested mainly in the creation of a rich infrastructure of tools that enable developers and end-users to work with the format. This work will present the current state of this infrastructure. SN - 9781911529248 UR - http://www.jstor.org/stable/j.ctv3t5qjk.13 ID - FOLIACLARINBOOK ER - TY - CHAP AU - Savary, Agata AU - Candito, Marie AU - Mititelu, Verginica Barbu AU - Bejček, Eduard AU - Cap, Fabienne AU - Čéplö, Slavomír AU - Cordeiro, Silvio Ricardo AU - Eryiğit, Gülşen AU - Giouli, Voula AU - van Gompel, Maarten AU - et al. PY - 2018 DA - 2018/Oct/ TI - PARSEME multilingual corpus of verbal multiword expressions BT - Multiword expressions at length and in depth: Extended papers from the MWE 2017 workshop, SP - 87 EP - 147 PB - Language Science Press AB - Multiword expressions (MWEs) are known as a ’pain in the neck’ due to their idiosyncratic behaviour. While some categories of MWEs have been largely studied, verbal MWEs (VMWEs) such as to take a walk, to break one’s heart or to turn off have been relatively rarely modelled. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. In this joint effort carried out within a European research network we elaborated a universal terminology and annotation methodology for VMWEs. Its main outcomes, available under open licenses, are unified annotation guidelines, and a corpus of over 5.4 million words and 62 thousand annotated VMWEs in 18 languages. SN - 978-3-96110-123-8 UR - https://doi.org/10.5281/zenodo.1471591 DO - 10.5281/zenodo.1471591 ID - PARSEME ER - TY - CHAP AU - Kemps-Snijders, Marc AU - Schuurman, Ineke AU - Daelemans, Walter AU - Demuynck, Kris AU - Desplanques, Brecht AU - Hoste, Véronique AU - Huijbregts, Marijn AU - Martens, Jean-Pierre AU - Paulussen, Hans AU - Pelemans, Joris AU - Reynaert, Martin AU - Vandeghinste, Vincent AU - van den Bosch, Antal AU - van den Heuvel, Henk AU - van Gompel, Maarten AU - van Noord, Gertjan AU - Wambacq, Patrick PY - 2017 DA - 2017// TI - TTNWW to the Rescue: No Need to Know How to Handle Tools and Resources BT - CLARIN in the Low Countries SP - 83 EP - 94 PB - Ubiquity Press AB - The idea behind the Flemish/Dutch CLARIN project TTNWW¹ (’TST Tools voor het Nederlands als Webservices in een Workflow’, or ‘NLP Tools for Dutch as Web services in a Workflow’) was that many end users of resources and tools offered by CLARIN will not know how to use them, just as they will not know where they are located. With respect to the location, the CLARIN policy is that the Human and Social Sciences (HSS) researcher does not need to know this as the infrastructure will take care of that: the only thing the user needs to do is to indicate SN - 9781911529248 UR - http://www.jstor.org/stable/j.ctv3t5qjk.14 ID - TTNWW ER - TY - RPRT AU - van Gompel, M. AU - Noordzij, J. AU - de Valk, R. AU - Scharnhorst, A. PY - 2018 DA - 2018// TI - Guidelines for Software Quality PB - CLARIAH UR - https://github.com/CLARIAH/software-quality-guidelines/raw/v1.0/softwareguidelines.pdf ID - SOFTWAREQUALITY ER - TY - JOUR AU - van Gompel, Maarten AU - van den Bosch, Antal PY - 2016 DA - 2016// TI - Efficient n-gram, skipgram and flexgram modelling with Colibri Core JO - Journal of Open Research Software VL - 4 IS - 1 PB - Ubiquity Press AB - Counting n-grams lies at the core of any frequentist corpus analysis and is often considered a trivial matter. Going beyond consecutive n-grams to patterns such as skipgrams and flexgrams increases the demand for efficient solutions. The need to operate on big corpus data does so even more. Lossless compression and non-trivial algorithms are needed to lower the memory demands, yet retain good speed. Colibri Core is software for the efficient computation and querying of n-grams, skipgrams and flexgrams from corpus data. The resulting pattern models can be analysed and compared in various ways. The software offers a programming library for C++ and Python, as well as command-line tools. UR - https://openresearchsoftware.metajnl.com/articles/10.5334/jors.105/ UR - https://doi.org/10.5334/jors.105 DO - 10.5334/jors.105 ID - COLIBRICORE ER - TY - JOUR AU - van Gompel, Maarten AU - van den Bosch, Antal PY - 2016 DA - 2016// TI - The role of context information in L2 translation assistance JO - International Journal of Translation VL - 28 IS - 1-2 PB - Bahri Publications ID - COLIBRITAFINAL ER - TY - JOUR AU - Jiménez, R. C. AU - Kuzak, M. AU - Alhamdoosh, M. AU - Barker, M. AU - Batut, B. AU - Borg, M. AU - Capella-Gutierrez, S. AU - Chue Hong, N. AU - Cook, M. AU - Corpas, M. AU - Flannery, M. AU - Garcia, L. AU - Gelpí, J. L. AU - Gladman, S. AU - Goble, C. AU - González Ferreiro, M. AU - Gonzalez-Beltran, A. AU - Griffin, P. C. AU - Grüning, B. AU - Hagberg, J. AU - Holub, P. AU - Hooft, R. AU - Ison, J. AU - Katz, D. S. AU - Leskošek, B. AU - López Gómez, F. AU - Oliveira, L. J. AU - Mellor, D. AU - Mosbergen, R. AU - Mulder, N. AU - Perez-Riverol, Y. AU - Pergl, R. AU - Pichler, H. AU - Pope, B. AU - Sanz, F. AU - Schneider, M. V. AU - Stodden, V. AU - Suchecki, R. AU - Svobodová Vařeková, R. AU - Talvik, H. A. AU - Todorov, I. AU - Treloar, A. AU - Tyagi, S. AU - van Gompel, M. AU - Vaughan, D. AU - Via, A. AU - Wang, X. AU - Watson-Haigh, N. S. AU - Crouch, S. PY - 2017 DA - 2017// TI - Four simple recommendations to encourage best practices in research software JO - F1000Research VL - 6 IS - 876 AB - Scientific research relies on computer software, yet software is not always developed following practices that ensure its quality and sustainability. This manuscript does not aim to propose new software development best practices, but rather to provide simple recommendations that encourage the adoption of existing best practices. Software development best practices promote better quality software, and better quality software improves the reproducibility and reusability of research. These recommendations are designed around Open Source values, and provide practical suggestions that contribute to making research software and its source code more discoverable, reusable and transparent. This manuscript is aimed at developers, but also at organisations, projects, journals and funders that can increase the quality and sustainability of research software by encouraging the adoption of these recommendations. UR - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5490478/ UR - https://doi.org/10.12688/f1000research.11407.1 DO - 10.12688/f1000research.11407.1 ID - F1000RESEARCH ER - TY - CONF AU - Reynaert, M. AU - van Gompel, M. AU - van der Sloot, K. AU - van den Bosch, A. PY - 2015 DA - 2015// TI - PICCL: Philosophical Integrator of Computational and Corpus Libraries BT - Proceedings of CLARIN Annual Conference 2015 – Book of Abstracts PB - CLARIN ERIC UR - http://www.clarin.eu/sites/default/files/book%20of%20abstracts%202015.pdf ID - Reynaert+15 ER - TY - CONF AU - van Gompel, M. AU - van den Bosch, A. PY - 2014 DA - 2014/Jun/ TI - Translation Assistance by Translation of L1 Fragments in an L2 Context BT - Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) SP - 871 EP - 880 PB - Association for Computational Linguistics AB - In this paper we present new research in translation assistance. We describe a system capable of translating native language (L1) fragments to foreign language (L2) fragments in an L2 context. Practical applications of this research can be framed in the context of second language learning. The type of translation assistance system under investigation here encourages language learners to write in their target language while allowing them to fall back to their native language in case the correct word or expression is not known. These code switches are subsequently translated to L2 given the L2 context. We study the feasibility of exploiting cross-lingual context to obtain high-quality translation suggestions that improve over statistical language modelling and word-sense disambiguation baselines. A classification-based approach is presented that is indeed found to improve significantly over these baselines by making use of a contextual window spanning a small number of neighbouring words. UR - http://www.aclweb.org/anthology/P14-1082 ID - COLIBRITAPILOT ER - TY - JOUR AU - Maat, H. P. AU - Kraf, R. AU - van den Bosch, A. AU - Dekker, N. AU - van Gompel, M. AU - Kleijn, S. AU - Sanders, T. AU - van der Sloot, K. PY - 2014 DA - 2014// TI - T-Scan: a new tool for analyzing Dutch text JO - Computational Linguistics in the Netherlands Journal SP - 53 EP - 74 VL - 4 AB - T-Scan is a new tool for analyzing Dutch text. It aims at extracting text features that are theoretically interesting, in that they relate to genre and text complexity, as well as practically interesting, in that they enable users and text producers to make text-specific diagnoses. T-Scan derives it features from tools such as Frog and Alpino, and resources such as SoNaR, SUBTLEX-NL and Referentie Bestand Nederlands. This paper offers a qualitative discussion of a number of T-Scan features, based on a minimal demonstration corpus of six texts, three of them scientific articles and three of them drawn from a women’s magazine. We discuss features concerning lexical complexity, sentence complexity, referential cohesion and lexical diversity, lexical semantics and personal style. For all these domains we examine the construct validity as well as the reliability of a number of important features. We conclude that T-Scan offers a number of promising lexical and syntactic features, while the interpretation of referential cohesion/ lexical diversity features and personal style features is less clear. Further developing the application and analyzing authentic text need to go hand in hand. SN - 2211-4009 UR - https://www.clinjournal.org/sites/clinjournal.org/files/05-PanderMaat-etal-CLIN2014.pdf ID - TSCAN ER - TY - CONF AU - van Gompel, M. AU - Reynaert, M. PY - 2014 DA - 2014// TI - CLAM: Quickly deploy NLP command-line tools on the web BT - Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations SP - 71 EP - 75 PB - Dublin City University and Association for Computational Linguistics AB - In this paper we present the software CLAM; the Computational Linguistics Application Mediator. CLAM is a tool that allows you to quickly and transparently transform command-line NLP tools into fully-fledged RESTful webservices with which automated clients can communicate, as well as a generic webapplication interface for human end-users. UR - http://aclweb.org/anthology/C14-2016 ID - CLAMPAPER ER - TY - RPRT AU - van Gompel, M. PY - 2014 DA - 2014// TI - CLAM: Computational Linguistics Application Mediator PB - Radboud University Nijmegen UR - https://github.com/proycon/clam/raw/v2.3.6/docs/clam_manual.pdf ID - CLAMDOC ER - TY - RPRT AU - van Gompel, M. PY - 2014 DA - 2014// TI - FoLiA: Format for Linguistic Annotation. Documentation PB - Radboud University Nijmegen UR - https://github.com/proycon/folia/raw/v1.5.1.60/docs/folia.pdf ID - FOLIADOC ER - TY - CONF AU - van Gompel, M. AU - Hendrickx, I. AU - van den Bosch, A. AU - Lefever, E. AU - Hoste, V. PY - 2014 DA - 2014// TI - Semeval-2014 Task 5: L2 writing assistant BT - Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) AB - We present a new cross-lingual task for SemEval concerning the translation of L1 fragments in an L2 context. The task is at the boundary of Cross-Lingual Word Sense Disambiguation and Machine Translation. It finds its application in the field of computer-assisted translation, particularly in the context of second language learning. Translating L1 fragments in an L2 context allows language learners when writing in a target language (L2) to fall back to their native language (L1) whenever they are uncertain of the right word or phrase. UR - http://aclweb.org/anthology/S14-2005 ID - SEMEVAL2014TASK5 ER - TY - JOUR AU - van Gompel, M. AU - van den Bosch, A. AU - Dykstra, A. PY - 2014 DA - 2014// TI - Oersetter: Frisian-Dutch statistical machine translation JO - Philologia Frisica anno 2012 SP - 287 EP - 296 PB - Fryske Akademy AB - In this paper we present a statistical machine translation (SMT) system for Frisian to Dutch and Dutch to Frisian. A parallel training corpus has been established, which has subsequently been used to automatically learn a phrase-based SMT model. The translation system is built around the open-source SMT software Moses. The resulting system, named Oersetter , is released as a website for human end users, as well as a web service for software to interact with. We here discuss the workings, setup and performance of our system, which to our knowledge is the very first Frisian-Dutch SMT system. UR - http://hdl.handle.net/2066/129749 ID - OERSETTER ER - TY - CONF AU - Beißwenger, M. AU - Chanier, T. AU - Chiari, I. AU - Ermakova, M. AU - v. Gompel, M. AU - Hendrickx, I. AU - Herold, A. AU - Heuvel, H. V. D. AU - Lemnitzer, L. AU - Storrer, A. AU - others PY - 2013 DA - 2013// TI - Computer-mediated communication in TEI: What lies ahead BT - The Linked TEI: Text Encoding in the Web. 2013 Annual Conference and Members’ Meeting of the TEI Consortium ID - CMCTEI ER - TY - JOUR AU - van Gompel, M. AU - Reynaert, M. PY - 2013 DA - 2013// TI - FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study JO - Computational Linguistics in the Netherlands Journal VL - 3 AB - In this paper we present FoLiA, a Format for Linguistic Annotation, and conduct a comparative study with other annotation schemes, including the Linguistic Annotation Framework (LAF), the Text Encoding Initiative (TEI) and Text Corpus Format (TCF). An additional point of focus is the interoperability between FoLiA and metadata standards such as the Component MetaData Infrastructure (CMDI), as well as data category registries such as ISOcat. The aim of the paper is to present a clear image of the capabilities of FoLiA and how it relates to other formats. This should open discussion and aid users in their decision for a particular format. FoLiA is a practically-oriented XML-based annotation format for the representation of language resources, explicitly supporting a wide variety of annotation types. It introduces a flexible and uniform paradigm and a representation independent of language or label set. It is designed to be highly expressive, generic, and formalised, whilst at the same time focussing on being as practical as possible to ease its adoption and implementation. The aspiration is to offer a generic format for storage, exchange, and machine-processing of linguistically annotated documents, preventing users as well as software tools from having to cope with a wide variety of different formats, which in the field regularly causes convertibility issues and proliferation of ad-hoc formats. FoLiA emerged from such a practical need in the context of Computational Linguistics in the Netherlands and Flanders. It has been successfully adopted by numerous projects within this community. FoLiA was developed in a bottom-up fashion, with special emphasis on software libraries and tools to handle it. UR - http://clinjournal.org/sites/clinjournal.org/files/05-vanGompel-Reynaert-CLIN2013.pdf ID - FOLIAPAPER ER - TY - CONF AU - van Gompel, M. AU - van den Bosch, A. PY - 2013 DA - 2013// TI - WSD2: parameter optimisation for memory-based cross-lingual word-sense disambiguation BT - Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), in conjunction with the Second Joint Conference on Lexical and Computational Semantics PB - New Brunswick, NJ: Association for Computational Linguistics AB - We present our system WSD2 which participated in the Cross-Lingual Word-Sense Disambiguation task for SemEval 2013 (Lefever and Hoste, 2013). The system closely resembles our winning system for the same task in SemEval 2010. It is based on k-nearest neighbour classifiers which map words with local and global context features onto their transla tion, i.e. their cross-lingual sense. The system participated in the task for all five languages and obtained winning scores for four of them when asked to predict the best translation(s). We tested various configurations of our system, focusing on various levels of hyperparameter optimisation and feature selection. Our final results indicate that hyperparameter optimisation did not lead to the best results, indicating overfitting by our optimisation method in this aspect. Feature selection does have a modest positive impact. UR - https://www.aclweb.org/anthology/S13-2033 ID - WSD2 ER - TY - CONF AU - Reynaert, M. AU - Schuurman, I. AU - Hoste, V. AU - Oostdijk, N. AU - van Gompel, M. PY - 2012 DA - 2012// TI - Beyond SoNaR: towards the facilitation of large corpus building efforts BT - Proceedings of the Eighth International conference on Language Resources and Evaluation (LREC) VL - 8 AB - In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to the duration in time as well to the budget, employing current state-of-the-art tools, standards and best practices. By doing so we aim to pass on insights that may be beneficial for anyone considering to undertake an effort towards building a large, varied yet balanced corpus for use by the wider research community. Various issues are discussed that come into play while compiling a large corpus, including approaches to acquiring texts, the arrangement of IPR, the choice of text formats, and steps to be taken in the preprocessing of data from widely different origins. We describe FoLiA, a new XML format geared at rich linguistic annotations. We also explain the rationale behind the investment in the high-quality semi-automatic enrichment of a relatively small (1 MW) subset with very rich syntactic and semantic annotations. Finally, we present some ideas about future developments and the direction corpus development may take, such as setting up an integrated work flow between web services and the potential role for ISOcat. We list tips for potential corpus builders, tricks they may want to try and further recommendations regarding technical developments future corpus builders may wish to hope for. UR - http://www.lrec-conf.org/proceedings/lrec2012/pdf/748_Paper.pdf ID - SONAR ER - TY - RPRT AU - van Gompel, M. AU - van der Sloot, K. AU - van den Bosch, A. PY - 2012 DA - 2012// TI - Ucto: Unicode Tokeniser. Version 0.5.3. Reference Guide IS - ILK 12-05 PB - ILK Research Group, Tilburg University UR - https://github.com/LanguageMachines/ucto/raw/v0.14.1/docs/ucto_manual.pdf ID - UCTO12 ER - TY - CONF AU - Vossen, P. AU - Görög, A. AU - Laan, F. AU - van Gompel, M. AU - Izquierdo-Bevia, R. AU - van den Bosch, A. PY - 2011 DA - 2011// TI - DutchSemCor: building a semantically annotated corpus for Dutch BT - Electronic lexicography in the 21st century: New Applications for New Users: Proceedings of eLex 2011, Bled, 10-12 November 2011 AB - State of the art Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years, while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver a Dutch corpus that is sense-tagged with senses from the Cornetto lexical database. Part of this corpus (circa 300K examples) is manually tagged. The remainder is automatically tagged using different WSD systems and validated by human annotators. The project uses existing corpora compiled in other projects; these are extended with Internet examples f or word senses that are less frequent and do not (sufficiently) appear in the corpora. We report on the status of the project and the evaluations of the WSD systems with the current training data. UR - https://repository.ubn.ru.nl/handle/2066/94383 ID - DUTCHSEMCOR ER - TY - CONF AU - van Gompel, M. PY - 2010 DA - 2010// TI - UvT-WSD1: A cross-lingual word sense disambiguation system BT - SemEval ’10: Proceedings of the 5th International Workshop on Semantic Evaluation SP - 238 EP - 241 PB - Association for Computational Linguistics CY - Morristown, NJ, USA KW - ilk, vici, dutchsemcor, wsd, semeval, cross-lingual, word sense disambiguation AB - This paper describes the Cross-Lingual Word Sense Disambiguation system UvT-WSD1, developed at Tilburg University, for participation in two SemEval-2 tasks: the Cross-Lingual Word Sense Disambiguation task and the Cross-Lingual Lexical Substitution task. The UvT-WSD1 system makes use of k-nearest neighbour classifiers, in the form of single-word experts for each target word to be disam- biguated. These classifiers can be constructed using a variety of local and global context features, and these are mapped onto the translations, i.e. the senses, of the words. The system works for a given language-pair, either English-Dutch or English-Spanish in the current implementation, and takes a word-aligned parallel corpus as its input. UR - http://aclweb.org/anthology/S10-1053 ID - UVTWSD1 ER - TY - CONF AU - van Gompel, M. AU - van den Bosch, A. AU - Berck, P. ED - Forcada, M. ED - Way, A. PY - 2009 DA - 2009// TI - Extending memory-based machine translation to phrases BT - Proceedings of the Third Workshop on Example-Based Machine Translation SP - 79 EP - 86 CY - Dublin, Ireland KW - ilk, dutchsemcor, memory-based machine translation, vici, pbmbmt, mbmt AB - We present a phrase-based extension to memory-based machine translation. This form of example- based machine translation employs lazy-learning classifiers to translate fragments of the source sen- tence to fragments of the target sentence. Source-side fragments consist of variable-length phrases in a local context of neighboring words, translated by the classifier to a target-language phrase. We compare three methods of phrase extraction, and present a new decoder that reassembles the trans- lated fragments into one final translation. Results show that one of the proposed phrase-extraction methods—the one used in Moses—leads to a translation system that outperforms context-sensitive word-based approaches. The differences, however, are small, arguably because the word-based ap- proaches already capture phrasal context implicitly due to their source-side and target-side context sensitivity. UR - https://ilk.uvt.nl/mbmt/pbmbmt/pbmbmt-dublin.pdf ID - PBMBMTPAPER ER - TY - THES AU - van Gompel, M. PY - 2009 DA - 2009// TI - Phrase-based Memory-based Machine Translation PB - Tilburg University CY - the Netherlands UR - https://proycon.anaproy.nl/pubs/pbmbmt_thesis.pdf ID - MASTERSTHESIS U1 - Masters thesis ER -