Converted from a Word document
With the increasing availability of large corpora, humanist scholars gain opportunities to choose their material in a more data-driven way. How can we identify texts or text sections relevant to our research question if we abandon prior knowledge as a determining factor? In this paper, we explore the potential of semantic fields for finding text sections about a topic of interest.
We use the term “topic” in the sense of “subject of a text”. We do not want this to be confused with the term used for the results of topic modeling methods. By topic, we do not refer to the topic modelling concept (Blei, 2012), but to different subjects in a text.
The use case we present is the identification of text sections about body and illness. This is motivated by our larger research project that focuses on health (Gaidys et al., 2017). As test data, we use extracts of about 7,000 words from diverse research domains addressed in our research project:
Our guidelines for the manual annotation are available in German
http://doi.org/10.5281/zenodo.2634297
We calculate the agreement between the annotators in order to estimate the difficulty of the task and the quality of our guidelines. For this calculation, we compare the annotations sentence by sentence. If any word was annotated, we consider the whole sentence to be annotated. The objective of the task is to identify text sections and not phrases, so this abstraction is adequate. It also facilitates comparison as we do not need to deal with overlaps. In terms of agreement this is a rather tolerant approach. As a result, the agreement is relatively high, given the interpretative nature of the task. The chance-corrected scores range between 0.54 and 0.90, showing the varying difficulty in the texts and topics. Some of the disagreement could potentially be avoided by further refinement of the guidelines.
Table 1: Inter-annotator agreement, measured by Kappa (Fleiss, 1971) (no mentions of body in the protocol)
For the gold standard, the annotators and two other researchers resolved the discrepancies between the two annotations. Table 2 shows the absolute numbers of annotated sentences.
Table 2: Number of annotated sentences for the two topics
We generated semantic fields in the following ways (Adelmann et al., 2019):
http://www.dnb.de/gnd (accessed April 29, 2019)
All words of the semantic fields were expanded by all possible inflection forms using SFST (Schmid, 2005) and the model by (Sennrich and Kunz, 2014). The texts were automatically tagged with the three semantic fields using CATMA’s query function.
For the evaluation of the semantic field approach we compare it sentence by sentence with the gold standard. Table 3 shows the results for precision, recall and F1 scores. As can be expected for an annotation task involving much interpretation, not even half the scores reach more than 0.5. The GND semantic field has a better recall than precision as it is very large, especially for illness. GermaNet and WE score higher on precision than recall. The combination of all three semantic fields results in a clear improvement for the semantic field of body.
Table 3: Results (scores above 0.5 in bold)
For example, words like ‘Hand’ (hand) as a part of the body or ‘Virus’ (virus) as an indicator for illness were found both by the manual annotations as by our queries using the semantic fields. Our approach generates false negatives when the topics of interest were mentioned in an indirect way, as it is frequently the case in literature such as ‘zu ihren Füßen’ (at her feet). Additionally, our semantic fields consist of nouns only, so all other parts of speech were neglected. False positives were produced when words about body or illness were used metaphorically as for example ‘aus dem Auge verlieren’ (to lose track of) or mentions of ‘Herz’ (heart) in the context of ‘offenherziges Lächeln’ (open-hearted smile) .
The identification of specific topics using existing or automatically generated semantic fields does not fully reproduce what human annotators do. Researchers relying on this method should be aware that they systematically lose texts with specific features such as a more indirect style which results in a biased corpus. There are many false positives that can be manually removed. For scenarios with large corpora, an approach like this is still a feasible one. If we apply the method to identify units of text larger than sentences, the results might improve. We intend to conduct experiments to this end in the future.
A higher-level question is how we can adequately evaluate tasks involving a great deal of interpretation. There are many possible ways of operationalizing, the topic body and our annotations guidelines represent only one. We consider our contribution to be a rough first approximation to a solution of this issue.