Evaluation of a Semantic Field-Based Approach to Identifying Text Sections about Specific Topics

Introduction

With the increasing availability of large corpora, humanist scholars gain opportunities to choose their material in a more data-driven way. How can we identify texts or text sections relevant to our research question if we abandon prior knowledge as a determining factor? In this paper, we explore the potential of semantic fields for finding text sections about a topic of interest.

We use the term “topic” in the sense of “subject of a text”. We do not want this to be confused with the term used for the results of topic modeling methods. By topic, we do not refer to the topic modelling concept (Blei, 2012), but to different subjects in a text.

Additionally, we want to address the major issue of evaluating a task involving a great deal of interpretation.

The use case we present is the identification of text sections about body and illness. This is motivated by our larger research project that focuses on health (Gaidys et al., 2017). As test data, we use extracts of about 7,000 words from diverse research domains addressed in our research project:

the novel Corpus Delicti (2009) by Juli Zeh; the novel Ellen Olestjerne (1903) by Franziska zu Reventlow; an interview with a dying patient; and a protocol from the German Bundestag (federal parliament).

Manually annotating topics

Our guidelines for the manual annotation are available in German

http://doi.org/10.5281/zenodo.2634297

. They determine that we consider a text span to be about illness or body if it is depicted as such by the text. The exact decision of what to annotate remains somewhat arbitrary, as is unavoidable in hermeneutic annotations. In order to agree on a concept of body and illness across disciplines, both were defined in a narrow way. The annotation was carried out by two independent annotators who did not receive any input beyond the guidelines. We used the annotation tool CATMA (Meister et al., 2018).

We calculate the agreement between the annotators in order to estimate the difficulty of the task and the quality of our guidelines. For this calculation, we compare the annotations sentence by sentence. If any word was annotated, we consider the whole sentence to be annotated. The objective of the task is to identify text sections and not phrases, so this abstraction is adequate. It also facilitates comparison as we do not need to deal with overlaps. In terms of agreement this is a rather tolerant approach. As a result, the agreement is relatively high, given the interpretative nature of the task. The chance-corrected scores range between 0.54 and 0.90, showing the varying difficulty in the texts and topics. Some of the disagreement could potentially be avoided by further refinement of the guidelines.

Corpus Delicti Ellen Olestjerne Interview Protocol body 0.90 0.86 0.66 - illness 0.54 0.71 0.74 0.84

Table 1: Inter-annotator agreement, measured by Kappa (Fleiss, 1971) (no mentions of body in the protocol)

For the gold standard, the annotators and two other researchers resolved the discrepancies between the two annotations. Table 2 shows the absolute numbers of annotated sentences.

Corpus Delicti Ellen Olestjerne Interview Protocol body 113 38 38 - illness 22 27 67 13

Table 2: Number of annotated sentences for the two topics

Semantic field generation

We generated semantic fields in the following ways (Adelmann et al., 2019):

The German Integrated Authority File GND

http://www.dnb.de/gnd (accessed April 29, 2019)

(‘Gemeinsame Normdatei’, Wiechmann, 2012) is a controlled vocabulary with a hierarchy of concepts and references. All hyponyms for body (2,704 words) and illness (11,935 words) were extracted. GermaNet (Hamp and Feldweg, 1997; Henrich and Hinrichs, 2010) is a lexical-semantic net that is structured in hypernym and hyponym relations. We extracted all hyponyms to body (2,170) and illness (2,043). We excluded all hyponyms to illness from the semantic field of body. We created word embeddings (WE) with the “gensim” implementation (Řehůřek and Sojka, 2010) of word2vec (Mikolov et al., 2013) on a collection of more than 2,500 full-texts of German literature from around 1900. We took the 100 words most similar (in terms of the cosine similarity of their embedding vectors) to body and illness, respectively.

All words of the semantic fields were expanded by all possible inflection forms using SFST (Schmid, 2005) and the model by (Sennrich and Kunz, 2014). The texts were automatically tagged with the three semantic fields using CATMA’s query function.

Evaluation

For the evaluation of the semantic field approach we compare it sentence by sentence with the gold standard. Table 3 shows the results for precision, recall and F1 scores. As can be expected for an annotation task involving much interpretation, not even half the scores reach more than 0.5. The GND semantic field has a better recall than precision as it is very large, especially for illness. GermaNet and WE score higher on precision than recall. The combination of all three semantic fields results in a clear improvement for the semantic field of body.

illness body Precision Recall F1 Precision Recall F1 GND 0.20 0.41 0.31 0.55 0.69 0.62 GermaNet 0.85 0.23 0.54 0.30 0.13 0.21 WE 0.60 0.34 0.47 0.57 0.23 0.40 mean 0.55 0.33 0.44 0.47 0.35 0.41 combination 0.21 0.45 0.33 0.43 0.79 0.61

Table 3: Results (scores above 0.5 in bold)

For example, words like ‘Hand’ (hand) as a part of the body or ‘Virus’ (virus) as an indicator for illness were found both by the manual annotations as by our queries using the semantic fields. Our approach generates false negatives when the topics of interest were mentioned in an indirect way, as it is frequently the case in literature such as ‘zu ihren Füßen’ (at her feet). Additionally, our semantic fields consist of nouns only, so all other parts of speech were neglected. False positives were produced when words about body or illness were used metaphorically as for example ‘aus dem Auge verlieren’ (to lose track of) or mentions of ‘Herz’ (heart) in the context of ‘offenherziges Lächeln’ (open-hearted smile) .

Conclusions

The identification of specific topics using existing or automatically generated semantic fields does not fully reproduce what human annotators do. Researchers relying on this method should be aware that they systematically lose texts with specific features such as a more indirect style which results in a biased corpus. There are many false positives that can be manually removed. For scenarios with large corpora, an approach like this is still a feasible one. If we apply the method to identify units of text larger than sentences, the results might improve. We intend to conduct experiments to this end in the future.

A higher-level question is how we can adequately evaluate tasks involving a great deal of interpretation. There are many possible ways of operationalizing, the topic body and our annotations guidelines represent only one. We consider our contribution to be a rough first approximation to a solution of this issue.

Bibliography Adelmann, B., Franken, L., Gius, E., Krüger, K. and Vauth, M. (2019). Die Generierung von Wortfeldern und ihre Nutzung als Findeheuristik. Ein Erfahrungsbericht zum Wortfeld „medizinisches Personal“. DHd 2019 Digital Humanities: multimedial & multimodal. Konferenzabstracts. pp. 114–16 doi:10.5281/zenodo.2596095. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4): 77 doi:10.1145/2133806.2133826. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5): 378–82 doi:10.1037/h0031619. Gaidys, U., Gius, E., Jarchow, M., Koch, G., Menzel, W., Orth, D. and Zinsmeister, H. (2017). hermA: Automated modelling of hermeneutic processes. Hamburger Journal für Kulturanthropologie(7): 119–23. Hamp, B. and Feldweg, H. (1997). GermaNet - a Lexical-Semantic Net for German. Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. pp. 9–15 http://www.aclweb.org/anthology/W97-0802 (accessed 29 April 2019). Henrich, V. and Hinrichs, E. (2010). GernEdiT - The GermaNet Editing Tool. Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010). Valletta, Malta, pp. 2228–35. Meister, J.-C., Petris, M., Gius, E., Jacke, J., Horstmann, J. and Bruck, C. (2018). Catma. doi:10.5281/zenodo.1470119. Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781v3 (accessed 29 April 2019). Řehůřek, R. and Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. Proceedings of LREC 2010 Workshop New Challenges for NLP Frameworks. University of Malta, pp. 46–50 http://www.fi.muni.cz/usr/sojka/presentations/lrec2010-poster-rehurek-sojka.pdf (accessed 29 April 2019). Schmid, H. (2005). A Programming Language for Finite State Transducers. Finite-State Methods and Natural Language Processing, 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1-2, 2005. Revised Papers. pp. 308–09 doi:10.1007/11780885_38. Sennrich, R. and Kunz, B. (2014). Zmorge: A German Morphological Lexicon Extracted from Wiktionary. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 1063–67. Wiechmann, B. (2012). Die Gemeinsame Normdatei (GND): Rückblick und Ausblick. Dialog mit Bibliotheken, 24(2): 20–22. urn:nbn:de:101-20130822163.