Converted from an OASIS Open Document
In digital humanities applications, tag clouds are often used as a means of distant reading (e.g., Beaven 2011, Koch et al. 2014, Hinrichs et al. 2015, Montague et al. 2015, John et al. 2016). By dissolving the structure of texts, thus, splitting it into words, the frequencies of different words, in the following denoted as
tags,
can be determined. Typically, tag clouds take the most frequent N tags of a text corpus, and by mapping frequency to font size and arranging the tags in a random manner on the screen, the observer gets a quick and intuitive summary of the textual content of the corpus. At least since Wordle (Viégas et al. 2009) has been offered to the public to generate tag clouds on demand, they enjoy great popularity and are widespreadly used.
Nevertheless, there are crucial theoretical problems in the design of tag clouds (Viégas et al. 2008) that question their benefit for text analysis tasks. Finding a single word without assistance is hard, long words receive more visual attraction than short ones as they cover more space, and font sizes, thus the frequencies of words, are difficult to compare. Furthermore, a tag cloud usually does not display all words in a corpus, thus, neglecting less frequent words can lead to misinterpretations. In this paper, we evaluate the value of several tag cloud visualization techniques that have been designed to support research tasks in various digital humanities scenarios. Continuing prior works of the co-authors (Jänicke et al. 2014, Jänicke and Geßner 2015), we base our analysis on the Bible known as the most often read and researched books, thus, well-suited for evaluation purposes. We chose the King James Bible being the most influential English translation.
TagPies have been designed to compare the contexts of different words (Jänicke et al. 2018). For a set of M different terms, TagPies generate M+1 different tag sets for shared (1 set) and non-shared vocabulary (M sets). TagVenn diagrams are an extension of TagPies aiming to compare co-occurrences more precisely as all combinations of shared vocabulary are considered. Taking three terms a, b and c and their respective sets of co-occurring tags A, B and C as an example, TagVenn diagrams visualize the following tag sets: A\(B∪C), B\(A∪C), C\(A∪B), (A∩B)\C, (A∩C)\B, (B∩C)\A and A∩B∩C. The tags are arranged in a Venn diagram style with a set of colors reflecting the cut sections. As the human ability to distinguish colors is limited, a maximum of four texts (generating 15 sets) can be analyzed with Tag Venn diagrams.
A similarity in topics and writing style can be seen looking at the King James Bible: It is known that the books of the four Evangelists have a lot in common with John differing the most from the others. This can clearly be seen by comparing John with Marcus and Matthew with the minimum number for occurrences set to four (Figure 1). John (JOH) and Marcus (MAR) have significantly less words in common than Mark and Matthew (MAT). It is interesting to see here that Matthew has the biggest number of words only used in this book instead of John as might be expected. Especially worthy for further investigation in the close reading were words like “truth” (27x) and “true” (13x) are frequently used by John, and less than 4x used by the other two Evangelists.
In the current version the scholar has to set a minimum number of occurrences, which may not always give accurate results. Then, the diagram might show a word as only occurring in one book of the Bible although other books may also include that word with a too small amount of occurrences. Setting a high number of occurrences will not always be interesting for a researcher since very frequently used words are not necessarily relevant for determining the content of a text. A workaround to avoid unwanted results could be extending the number of stopwords on demand. Also, adding the option to set a “maximum number of occurrences” would give more opportunities to research different questions. Maybe even considering classes of frequency could bring interesting results and open up an interesting field of research.
By scanning the first tag cloud (Figure 2A), users get a quick and intuitive overview of the textual content that all the chosen books have in common, like “lord” and “jesus” as well as differences in topics. Looking at words only appearing in one of the chosen books shows that the word “esaus” (Figure 2B, orange) might be a spelling mistake in book Genesis or a problem of normalization, if the apostrophe is filtered out, for the Biblical name “Esau”, which is used more often in different books of the Bible. Also interesting is the name “abram”, which is only used in one of the chosen books (Genesis, orange), for this is the rarely used former name of the very frequently mentioned “abraham”. Determining whether a word is relevant for a specific research question depends strongly on the books chosen to compare. The third tag cloud visualization (Figure 2C) offers an interesting overview of tags that describe the seven subparts and the individual documents. Document-related tags, for example, like the city of “babylon” only being mentioned in book
Jeremiah (green
) as well as tags that characterize all loaded parts such as “glory” or “prophet” can be easily identified.
Choosing the word “righteousness” with a maximum distance of 6 in all Bible books shows an interesting amount of times this word itself and variations of it are reoccurring in the near proximity: The adjective “righteous” appears once two words apart, four times three words apart, once four words apart, twice five words apart and twice six words apart. Also, the word itself is repeated several times (i.e. six times at four words apart) as well as its opposite “unrighteousness” etc. This case shows that the chosen word is frequently being used in a repetitive, sermon-like style of writing, e.g. in Psalms and especially in the Sermon on the Mount in Matthew.
Increasing the minimum number of occurrences can massively change the result. Currently, stopwords are omitted, but researchers might want stopwords taken into consideration when trying to detect interesting chains of words like sayings, proverbs etc. It could also be interesting to look for a single word re-occurring in the span of a work and visualizing this or looking for a more flexible span of words between two re-occurring words indicating (e.g. rhetoric or stylistic) habits of the author.
The three case studies outline that, despite the above mentioned theoretical problems, tag clouds can be–if they are carefully designed–valuable tools to support different research inquiries. As opposed to Wordle, all presented visualizations use tag color and position to express a tag’s set relations. It was important for the literary scholar in all scenarios to interactively get access to the underlying texts to examine upcoming hypotheses. Furthermore, different parameter sets shall be provided to generate multifarious views on the text in question.