TY  - CONF
AU  - van Gompel, Maarten
AU  - Windhouwer, Menzo
ED  - Vandeghinste, Vincent
ED  - Kontino, Thalassia
PY  - 2025
DA  - 2025/08/
TI  - FAIR Tool Discovery
BT  - Selected papers from the CLARIN Annual Conference 2024
SP  - 141
EP  - 150
PB  - Linköping Electronic Conference Proceedings
AB  - We present the Tool Discovery pipeline, a core component of the CLARIAH infrastructure in the Netherlands. This pipeline harvests software metadata from the source, detects existing heterogeneous metadata formats already in use by software developers, and converts them to a single uniform representation based on schema.org and codemeta. The resulting data is then made available for further ingestion into other user-facing catalogue/portal systems.
UR  - https://doi.org/10.3384/ecp216.12
DO  - 10.3384/ecp216.12
ID  - FAIRTOOLDISCOVERY25
ER  - 
TY  - CONF
AU  - Lendvai, Piroska
AU  - van Gompel, Maarten
AU  - Jouravel, Anna
AU  - Renje, Elena
AU  - Reichel, Uwe
AU  - Rabus, Achim
AU  - Arnold, Eckhart
ED  - Calzolari, Nicoletta
ED  - Kan, Min-Yen
ED  - Hoste, Veronique
ED  - Lenci, Alessandro
ED  - Sakti, Sakriani
ED  - Xue, Nianwen
PY  - 2024
DA  - 2024/05/
TI  - A Workflow for HTR-Postprocessing, Labeling and Classifying Diachronic and Regional Variation in Pre-Modern Slavic Texts
BT  - Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
SP  - 2039
EP  - 2048
PB  - ELRA and ICCL
CY  - Torino, Italia
AB  - We describe ongoing work for developing a workflow for the applied use case of classifying diachronic and regional language variation in Pre-Modern Slavic texts. The data were obtained via handwritten text recognition (HTR) on medieval manuscripts and printings and partly by manual transcription. Our goal is to develop a workflow for such historical language data, covering HTR-postprocessing, annotating and classifying the digitized texts. We test and adapt existing language resources to fit the pipeline with low-barrier tooling, accessible for Humanists with limited experience in research data infrastructures, computational analysis or advanced methods of natural language processing (NLP). The workflow starts by addressing ground truth (GT) data creation for diagnosing and correcting HTR errors via string metrics and data-driven methods. On GT and on HTR data, we subsequently show classification results using transfer learning on sentence-level text snippets. Next, we report on our token-level data labeling efforts. Each step of the workflow is complemented with describing current limitations and our corresponding work in progress.
UR  - https://aclanthology.org/2024.lrec-main.184
ID  - lendvai-etal-2024-workflow-htr
ER  - 
TY  - RPRT
AU  - van Gompel, M.
PY  - 2019
DA  - 2019//
TI  - FoLiA: Format for Linguistic Annotation - Documentation and Reference Guide
IS  - Language and Speech Technology Technical Report Series 19-01
PB  - Radboud University
UR  - https://folia.readthedocs.io/en/latest/
ID  - FOLIA19
ER  - 
TY  - RPRT
AU  - van Gompel, M.
PY  - 2018
DA  - 2018//
TI  - CLAM Documentation
IS  - Language and Speech Technology Technical Report Series 18-03
PB  - Radboud University
UR  - https://clam.readthedocs.io/en/latest/
ID  - CLAM18
ER  - 
TY  - RPRT
AU  - van der Sloot, K.
AU  - Hendrickx, I.
AU  - van Gompel, M.
AU  - van den Bosch, A.
AU  - Daelemans, W.
PY  - 2018
DA  - 2018//
TI  - Frog, A Natural Language Processing Suite for Dutch. Reference Guide
IS  - Language and Speech Technology Technical Report Series 18-02
PB  - Radboud University
UR  - https://frognlp.readthedocs.io/en/latest/
ID  - FROG18
ER  - 
TY  - RPRT
AU  - van Gompel, M.
AU  - van der Sloot, K.
AU  - I. Hendrickx
AU  - van den Bosch, A.
PY  - 2018
DA  - 2018//
TI  - Ucto: Unicode Tokeniser. Reference Guide
IS  - Language and Speech Technology Technical Report Series 18-01
PB  - Radboud University
UR  - https://ucto.readthedocs.io/en/latest/
ID  - UCTO18
ER  - 
TY  - JOUR
AU  - Beeksma, Merijn
AU  - Van Gompel, Maarten
AU  - Kunneman, Florian
AU  - Onrust, Louis
AU  - Regnerus, Bouke
AU  - Vinke, Dennis
AU  - Brito, Eduardo
AU  - Bauckhage, Christian
PY  - 2018
DA  - 2018/01/
TI  - Detecting and correcting spelling errors in high-quality Dutch Wikipedia text
JO  - Computational Linguistics in the Netherlands Journal
SP  - 122
EP  - 137
VL  - 8
ID  - CLIN28SHAREDTASK
ER  - 
TY  - CHAP
AU  - van Gompel, M.
AU  - van der Sloot, K.
AU  - Reynaert, M.
AU  - van den Bosch, A.
PY  - 2017
DA  - 2017//
TI  - FoLiA in Practice: The Infrastructure of a Linguistic Annotation Format
BT  - CLARIN in the Low Countries
SP  - 71
EP  - 82
PB  - Ubiquity Press
AB  - We present an overview of the software and data infrastructure for FoLiA, a Format for Linguistic Annotation developed within the scope of the CLARIN-NL project and other projects. FoLiA aims to provide a single unified file format accommodating a wide variety of linguistic annotation types, preventing the proliferation of different formats for different annotation types. FoLiA is being developed in a bottom-up and practice-driven fashion. We have invested mainly in the creation of a rich infrastructure of tools that enable developers and end-users to work with the format. This work will present the current state of this infrastructure.
SN  - 9781911529248
UR  - http://www.jstor.org/stable/j.ctv3t5qjk.13
ID  - FOLIACLARINBOOK
ER  - 
TY  - CHAP
AU  - Savary, Agata
AU  - Candito, Marie
AU  - Mititelu, Verginica Barbu
AU  - Bejček, Eduard
AU  - Cap, Fabienne
AU  - Čéplö, Slavomír
AU  - Cordeiro, Silvio Ricardo
AU  - Eryiğit, Gülşen
AU  - Giouli, Voula
AU  - van Gompel, Maarten
AU  - et al.
PY  - 2018
DA  - 2018/10/
TI  - PARSEME multilingual corpus of verbal multiword expressions
BT  - Multiword expressions at length and in depth: Extended papers from the MWE 2017 workshop,
SP  - 87
EP  - 147
PB  - Language Science Press
AB  - Multiword expressions (MWEs) are known as a ’pain in the neck’ due to their idiosyncratic behaviour. While some categories of MWEs have been largely studied, verbal MWEs (VMWEs) such as to take a walk, to break one’s heart or to turn off have been relatively rarely modelled. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. In this joint effort carried out within a European research network we elaborated a universal terminology and annotation methodology for VMWEs. Its main outcomes, available under open licenses, are unified annotation guidelines, and a corpus of over 5.4 million words and 62 thousand annotated VMWEs in 18 languages.
SN  - 978-3-96110-123-8
UR  - https://doi.org/10.5281/zenodo.1471591
DO  - 10.5281/zenodo.1471591
ID  - PARSEME
ER  - 
TY  - CHAP
AU  - Kemps-Snijders, Marc
AU  - Schuurman, Ineke
AU  - Daelemans, Walter
AU  - Demuynck, Kris
AU  - Desplanques, Brecht
AU  - Hoste, Véronique
AU  - Huijbregts, Marijn
AU  - Martens, Jean-Pierre
AU  - Paulussen, Hans
AU  - Pelemans, Joris
AU  - Reynaert, Martin
AU  - Vandeghinste, Vincent
AU  - van den Bosch, Antal
AU  - van den Heuvel, Henk
AU  - van Gompel, Maarten
AU  - van Noord, Gertjan
AU  - Wambacq, Patrick
PY  - 2017
DA  - 2017//
TI  - TTNWW to the Rescue: No Need to Know How to Handle Tools and Resources
BT  - CLARIN in the Low Countries
SP  - 83
EP  - 94
PB  - Ubiquity Press
AB  - The idea behind the Flemish/Dutch CLARIN project TTNWW¹ (’TST Tools voor het Nederlands als Webservices in een Workflow’, or ‘NLP Tools for Dutch as Web services in a Workflow’) was that many end users of resources and tools offered by CLARIN will not know how to use them, just as they will not know where they are located. With respect to the location, the CLARIN policy is that the Human and Social Sciences (HSS) researcher does not need to know this as the infrastructure will take care of that: the only thing the user needs to do is to indicate
SN  - 9781911529248
UR  - http://www.jstor.org/stable/j.ctv3t5qjk.14
ID  - TTNWW
ER  - 
TY  - RPRT
AU  - van Gompel, M.
AU  - Noordzij, J.
AU  - de Valk, R.
AU  - Scharnhorst, A.
PY  - 2018
DA  - 2018//
TI  - Guidelines for Software Quality
PB  - CLARIAH
UR  - https://github.com/CLARIAH/software-quality-guidelines/raw/v1.0/softwareguidelines.pdf
ID  - SOFTWAREQUALITY
ER  - 
TY  - JOUR
AU  - van Gompel, Maarten
AU  - van den Bosch, Antal
PY  - 2016
DA  - 2016//
TI  - Efficient n-gram, skipgram and flexgram modelling with Colibri Core
JO  - Journal of Open Research Software
VL  - 4
IS  - 1
PB  - Ubiquity Press
AB  - Counting n-grams lies at the core of any frequentist corpus analysis and is often considered a trivial matter. Going beyond consecutive n-grams to patterns such as skipgrams and flexgrams increases the demand for efficient solutions. The need to operate on big corpus data does so even more. Lossless compression and non-trivial algorithms are needed to lower the memory demands, yet retain good speed. Colibri Core is software for the efficient computation and querying of n-grams, skipgrams and flexgrams from corpus data. The resulting pattern models can be analysed and compared in various ways. The software offers a programming library for C++ and Python, as well as command-line tools.
UR  - https://openresearchsoftware.metajnl.com/articles/10.5334/jors.105/
UR  - https://doi.org/10.5334/jors.105
DO  - 10.5334/jors.105
ID  - COLIBRICORE
ER  - 
TY  - JOUR
AU  - van Gompel, Maarten
AU  - van den Bosch, Antal
PY  - 2016
DA  - 2016//
TI  - The role of context information in L2 translation assistance
JO  - International Journal of Translation
VL  - 28
IS  - 1-2
PB  - Bahri Publications
ID  - COLIBRITAFINAL
ER  - 
TY  - JOUR
AU  - Jiménez, R. C.
AU  - Kuzak, M.
AU  - Alhamdoosh, M.
AU  - Barker, M.
AU  - Batut, B.
AU  - Borg, M.
AU  - Capella-Gutierrez, S.
AU  - Chue Hong, N.
AU  - Cook, M.
AU  - Corpas, M.
AU  - Flannery, M.
AU  - Garcia, L.
AU  - Gelpí, J. L.
AU  - Gladman, S.
AU  - Goble, C.
AU  - González Ferreiro, M.
AU  - Gonzalez-Beltran, A.
AU  - Griffin, P. C.
AU  - Grüning, B.
AU  - Hagberg, J.
AU  - Holub, P.
AU  - Hooft, R.
AU  - Ison, J.
AU  - Katz, D. S.
AU  - Leskošek, B.
AU  - López Gómez, F.
AU  - Oliveira, L. J.
AU  - Mellor, D.
AU  - Mosbergen, R.
AU  - Mulder, N.
AU  - Perez-Riverol, Y.
AU  - Pergl, R.
AU  - Pichler, H.
AU  - Pope, B.
AU  - Sanz, F.
AU  - Schneider, M. V.
AU  - Stodden, V.
AU  - Suchecki, R.
AU  - Svobodová Vařeková, R.
AU  - Talvik, H. A.
AU  - Todorov, I.
AU  - Treloar, A.
AU  - Tyagi, S.
AU  - van Gompel, M.
AU  - Vaughan, D.
AU  - Via, A.
AU  - Wang, X.
AU  - Watson-Haigh, N. S.
AU  - Crouch, S.
PY  - 2017
DA  - 2017//
TI  - Four simple recommendations to encourage best practices in research software
JO  - F1000Research
VL  - 6
IS  - 876
AB  - Scientific research relies on computer software, yet software is not always developed following practices that ensure its quality and sustainability. This manuscript does not aim to propose new software development best practices, but rather to provide simple recommendations that encourage the adoption of existing best practices. Software development best practices promote better quality software, and better quality software improves the reproducibility and reusability of research. These recommendations are designed around Open Source values, and provide practical suggestions that contribute to making research software and its source code more discoverable, reusable and transparent. This manuscript is aimed at developers, but also at organisations, projects, journals and funders that can increase the quality and sustainability of research software by encouraging the adoption of these recommendations.
UR  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5490478/
UR  - https://doi.org/10.12688/f1000research.11407.1
DO  - 10.12688/f1000research.11407.1
ID  - F1000RESEARCH
ER  - 
TY  - CONF
AU  - Reynaert, M.
AU  - van Gompel, M.
AU  - van der Sloot, K.
AU  - van den Bosch, A.
PY  - 2015
DA  - 2015//
TI  - PICCL: Philosophical Integrator of Computational and Corpus Libraries
BT  - Proceedings of CLARIN Annual Conference 2015 – Book of Abstracts
PB  - CLARIN ERIC
UR  - http://www.clarin.eu/sites/default/files/book%20of%20abstracts%202015.pdf
ID  - Reynaert+15
ER  - 
TY  - CONF
AU  - van Gompel, M.
AU  - van den Bosch, A.
PY  - 2014
DA  - 2014/06/
TI  - Translation Assistance by Translation of L1 Fragments in an L2 Context
BT  - Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
SP  - 871
EP  - 880
PB  - Association for Computational Linguistics
AB  - In this paper we present new research in translation assistance. We describe a system capable of translating native language (L1) fragments to foreign language (L2) fragments in an L2 context. Practical applications of this research can be framed in the context of second language learning. The type of translation assistance system under investigation here encourages language learners to write in their target language while allowing them to fall back to their native language in case the correct word or expression is not known. These code switches are subsequently translated to L2 given the L2 context. We study the feasibility of exploiting cross-lingual context to obtain high-quality translation suggestions that improve over statistical language modelling and word-sense disambiguation baselines. A classification-based approach is presented that is indeed found to improve significantly over these baselines by making use of a contextual window spanning a small number of neighbouring words.
UR  - http://www.aclweb.org/anthology/P14-1082
ID  - COLIBRITAPILOT
ER  - 
TY  - JOUR
AU  - Maat, H. P.
AU  - Kraf, R.
AU  - van den Bosch, A.
AU  - Dekker, N.
AU  - van Gompel, M.
AU  - Kleijn, S.
AU  - Sanders, T.
AU  - van der Sloot, K.
PY  - 2014
DA  - 2014//
TI  - T-Scan: a new tool for analyzing Dutch text
JO  - Computational Linguistics in the Netherlands Journal
SP  - 53
EP  - 74
VL  - 4
AB  - T-Scan is a new tool for analyzing Dutch text. It aims at extracting text features that are theoretically interesting, in that they relate to genre and text complexity, as well as practically interesting, in that they enable users and text producers to make text-specific diagnoses. T-Scan derives it features from tools such as Frog and Alpino, and resources such as SoNaR, SUBTLEX-NL and Referentie Bestand Nederlands. This paper offers a qualitative discussion of a number of T-Scan features, based on a minimal demonstration corpus of six texts, three of them scientific articles and three of them drawn from a women’s magazine. We discuss features concerning lexical complexity, sentence complexity, referential cohesion and lexical diversity, lexical semantics and personal style. For all these domains we examine the construct validity as well as the reliability of a number of important features. We conclude that T-Scan offers a number of promising lexical and syntactic features, while the interpretation of referential cohesion/ lexical diversity features and personal style features is less clear. Further developing the application and analyzing authentic text need to go hand in hand.
SN  - 2211-4009
UR  - https://www.clinjournal.org/sites/clinjournal.org/files/05-PanderMaat-etal-CLIN2014.pdf
ID  - TSCAN
ER  - 
TY  - CONF
AU  - van Gompel, M.
AU  - Reynaert, M.
PY  - 2014
DA  - 2014//
TI  - CLAM: Quickly deploy NLP command-line tools on the web
BT  - Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations
SP  - 71
EP  - 75
PB  - Dublin City University and Association for Computational Linguistics
AB  - In this paper we present the software CLAM; the Computational Linguistics Application Mediator. CLAM is a tool that allows you to quickly and transparently transform command-line NLP tools into fully-fledged RESTful webservices with which automated clients can communicate, as well as a generic webapplication interface for human end-users.
UR  - http://aclweb.org/anthology/C14-2016
ID  - CLAMPAPER
ER  - 
TY  - RPRT
AU  - van Gompel, M.
PY  - 2014
DA  - 2014//
TI  - CLAM: Computational Linguistics Application Mediator
PB  - Radboud University Nijmegen
UR  - https://github.com/proycon/clam/raw/v2.3.6/docs/clam_manual.pdf
ID  - CLAMDOC
ER  - 
TY  - RPRT
AU  - van Gompel, M.
PY  - 2014
DA  - 2014//
TI  - FoLiA: Format for Linguistic Annotation. Documentation
PB  - Radboud University Nijmegen
UR  - https://github.com/proycon/folia/raw/v1.5.1.60/docs/folia.pdf
ID  - FOLIADOC
ER  - 
TY  - CONF
AU  - van Gompel, M.
AU  - Hendrickx, I.
AU  - van den Bosch, A.
AU  - Lefever, E.
AU  - Hoste, V.
PY  - 2014
DA  - 2014//
TI  - Semeval-2014 Task 5: L2 writing assistant
BT  - Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
AB  - We present a new cross-lingual task for SemEval concerning the translation of L1 fragments in an L2 context. The task is at the boundary of Cross-Lingual Word Sense Disambiguation and Machine Translation. It finds its application in the field of computer-assisted translation, particularly in the context of second language learning. Translating L1 fragments in an L2 context allows language learners when writing in a target language (L2) to fall back to their native language (L1) whenever they are uncertain of the right word or phrase.
UR  - http://aclweb.org/anthology/S14-2005
ID  - SEMEVAL2014TASK5
ER  - 
TY  - JOUR
AU  - van Gompel, M.
AU  - van den Bosch, A.
AU  - Dykstra, A.
PY  - 2014
DA  - 2014//
TI  - Oersetter: Frisian-Dutch statistical machine translation
JO  - Philologia Frisica anno 2012
SP  - 287
EP  - 296
PB  - Fryske Akademy
AB  - In this paper we present a statistical machine translation (SMT) system for Frisian to Dutch and Dutch to Frisian. A parallel training corpus has been established, which has subsequently been used to automatically learn a phrase-based SMT model. The translation system is built around the open-source SMT software Moses. The resulting system, named Oersetter , is released as a website for human end users, as well as a web service for software to interact with. We here discuss the workings, setup and performance of our system, which to our knowledge is the very first Frisian-Dutch SMT system.
UR  - http://hdl.handle.net/2066/129749
ID  - OERSETTER
ER  - 
TY  - CONF
AU  - Beißwenger, M.
AU  - Chanier, T.
AU  - Chiari, I.
AU  - Ermakova, M.
AU  - v. Gompel, M.
AU  - Hendrickx, I.
AU  - Herold, A.
AU  - Heuvel, H. V. D.
AU  - Lemnitzer, L.
AU  - Storrer, A.
AU  - others
PY  - 2013
DA  - 2013//
TI  - Computer-mediated communication in TEI: What lies ahead
BT  - The Linked TEI: Text Encoding in the Web. 2013 Annual Conference and Members’ Meeting of the TEI Consortium
ID  - CMCTEI
ER  - 
TY  - JOUR
AU  - van Gompel, M.
AU  - Reynaert, M.
PY  - 2013
DA  - 2013//
TI  - FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study
JO  - Computational Linguistics in the Netherlands Journal
VL  - 3
AB  - In this paper we present FoLiA, a Format for Linguistic Annotation, and conduct a comparative study with other annotation schemes, including the Linguistic Annotation Framework (LAF), the Text Encoding Initiative (TEI) and Text Corpus Format (TCF). An additional point of focus is the interoperability between FoLiA and metadata standards such as the Component MetaData Infrastructure (CMDI), as well as data category registries such as ISOcat. The aim of the paper is to present a clear image of the capabilities of FoLiA and how it relates to other formats. This should open discussion and aid users in their decision for a particular format. FoLiA is a practically-oriented XML-based annotation format for the representation of language resources, explicitly supporting a wide variety of annotation types. It introduces a flexible and uniform paradigm and a representation independent of language or label set. It is designed to be highly expressive, generic, and formalised, whilst at the same time focussing on being as practical as possible to ease its adoption and implementation. The aspiration is to offer a generic format for storage, exchange, and machine-processing of linguistically annotated documents, preventing users as well as software tools from having to cope with a wide variety of different formats, which in the field regularly causes convertibility issues and proliferation of ad-hoc formats. FoLiA emerged from such a practical need in the context of Computational Linguistics in the Netherlands and Flanders. It has been successfully adopted by numerous projects within this community. FoLiA was developed in a bottom-up fashion, with special emphasis on software libraries and tools to handle it.
UR  - http://clinjournal.org/sites/clinjournal.org/files/05-vanGompel-Reynaert-CLIN2013.pdf
ID  - FOLIAPAPER
ER  - 
TY  - CONF
AU  - van Gompel, M.
AU  - van den Bosch, A.
PY  - 2013
DA  - 2013//
TI  - WSD2: parameter optimisation for memory-based cross-lingual word-sense disambiguation
BT  - Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), in conjunction with the Second Joint Conference on Lexical and Computational Semantics
PB  - New Brunswick, NJ: Association for Computational Linguistics
AB  - We present our system WSD2 which participated in the Cross-Lingual Word-Sense Disambiguation task for SemEval 2013 (Lefever and Hoste, 2013). The system closely resembles our winning system for the same task in SemEval 2010. It is based on k-nearest neighbour classifiers which map words with local and global context features onto their transla tion, i.e. their cross-lingual sense. The system participated in the task for all five languages and obtained winning scores for four of them when asked to predict the best translation(s). We tested various configurations of our system, focusing on various levels of hyperparameter optimisation and feature selection. Our final results indicate that hyperparameter optimisation did not lead to the best results, indicating overfitting by our optimisation method in this aspect. Feature selection does have a modest positive impact.
UR  - https://www.aclweb.org/anthology/S13-2033
ID  - WSD2
ER  - 
TY  - CONF
AU  - Reynaert, M.
AU  - Schuurman, I.
AU  - Hoste, V.
AU  - Oostdijk, N.
AU  - van Gompel, M.
PY  - 2012
DA  - 2012//
TI  - Beyond SoNaR: towards the facilitation of large corpus building efforts
BT  - Proceedings of the Eighth International conference on Language Resources and Evaluation (LREC)
VL  - 8
AB  - In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to the duration in time as well to the budget, employing current state-of-the-art tools, standards and best practices. By doing so we aim to pass on insights that may be beneficial for anyone considering to undertake an effort towards building a large, varied yet balanced corpus for use by the wider research community. Various issues are discussed that come into play while compiling a large corpus, including approaches to acquiring texts, the arrangement of IPR, the choice of text formats, and steps to be taken in the preprocessing of data from widely different origins. We describe FoLiA, a new XML format geared at rich linguistic annotations. We also explain the rationale behind the investment in the high-quality semi-automatic enrichment of a relatively small (1 MW) subset with very rich syntactic and semantic annotations. Finally, we present some ideas about future developments and the direction corpus development may take, such as setting up an integrated work flow between web services and the potential role for ISOcat. We list tips for potential corpus builders, tricks they may want to try and further recommendations regarding technical developments future corpus builders may wish to hope for.
UR  - http://www.lrec-conf.org/proceedings/lrec2012/pdf/748_Paper.pdf
ID  - SONAR
ER  - 
TY  - RPRT
AU  - van Gompel, M.
AU  - van der Sloot, K.
AU  - van den Bosch, A.
PY  - 2012
DA  - 2012//
TI  - Ucto: Unicode Tokeniser. Version 0.5.3. Reference Guide
IS  - ILK 12-05
PB  - ILK Research Group, Tilburg University
UR  - https://github.com/LanguageMachines/ucto/raw/v0.14.1/docs/ucto_manual.pdf
ID  - UCTO12
ER  - 
TY  - CONF
AU  - Vossen, P.
AU  - Görög, A.
AU  - Laan, F.
AU  - van Gompel, M.
AU  - Izquierdo-Bevia, R.
AU  - van den Bosch, A.
PY  - 2011
DA  - 2011//
TI  - DutchSemCor: building a semantically annotated corpus for Dutch
BT  - Electronic lexicography in the 21st century: New Applications for New Users: Proceedings of eLex 2011, Bled, 10-12 November 2011
AB  - State of the art Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years, while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver a Dutch corpus that is sense-tagged with senses from the Cornetto lexical database. Part of this corpus (circa 300K examples) is manually tagged. The remainder is automatically tagged using different WSD systems and validated by human annotators. The project uses existing corpora compiled in other projects; these are extended with Internet examples f or word senses that are less frequent and do not (sufficiently) appear in the corpora. We report on the status of the project and the evaluations of the WSD systems with the current training data.
UR  - https://repository.ubn.ru.nl/handle/2066/94383
ID  - DUTCHSEMCOR
ER  - 
TY  - CONF
AU  - van Gompel, M.
PY  - 2010
DA  - 2010//
TI  - UvT-WSD1: A cross-lingual word sense disambiguation system
BT  - SemEval ’10: Proceedings of the 5th International Workshop on Semantic Evaluation
SP  - 238
EP  - 241
PB  - Association for Computational Linguistics
CY  - Morristown, NJ, USA
KW  - ilk, vici, dutchsemcor, wsd, semeval, cross-lingual, word sense disambiguation
AB  - This paper describes the Cross-Lingual Word Sense Disambiguation system UvT-WSD1, developed at Tilburg University, for participation in two SemEval-2 tasks: the Cross-Lingual Word Sense Disambiguation task and the Cross-Lingual Lexical Substitution task. The UvT-WSD1 system makes use of k-nearest neighbour classifiers, in the form of single-word experts for each target word to be disam- biguated. These classifiers can be constructed using a variety of local and global context features, and these are mapped onto the translations, i.e. the senses, of the words. The system works for a given language-pair, either English-Dutch or English-Spanish in the current implementation, and takes a word-aligned parallel corpus as its input.
UR  - http://aclweb.org/anthology/S10-1053
ID  - UVTWSD1
ER  - 
TY  - CONF
AU  - van Gompel, M.
AU  - van den Bosch, A.
AU  - Berck, P.
ED  - Forcada, M.
ED  - Way, A.
PY  - 2009
DA  - 2009//
TI  - Extending memory-based machine translation to phrases
BT  - Proceedings of the Third Workshop on Example-Based Machine Translation
SP  - 79
EP  - 86
CY  - Dublin, Ireland
KW  - ilk, dutchsemcor, memory-based machine translation, vici, pbmbmt, mbmt
AB  - We present a phrase-based extension to memory-based machine translation. This form of example- based machine translation employs lazy-learning classifiers to translate fragments of the source sen- tence to fragments of the target sentence. Source-side fragments consist of variable-length phrases in a local context of neighboring words, translated by the classifier to a target-language phrase. We compare three methods of phrase extraction, and present a new decoder that reassembles the trans- lated fragments into one final translation. Results show that one of the proposed phrase-extraction methods—the one used in Moses—leads to a translation system that outperforms context-sensitive word-based approaches. The differences, however, are small, arguably because the word-based ap- proaches already capture phrasal context implicitly due to their source-side and target-side context sensitivity.
UR  - https://ilk.uvt.nl/mbmt/pbmbmt/pbmbmt-dublin.pdf
ID  - PBMBMTPAPER
ER  - 
TY  - THES
AU  - van Gompel, M.
PY  - 2009
DA  - 2009//
TI  - Phrase-based Memory-based Machine Translation
PB  - Tilburg University
CY  - the Netherlands
UR  - https://proycon.anaproy.nl/pubs/pbmbmt_thesis.pdf
ID  - MASTERSTHESIS
U1  - Masters thesis
ER  -