The CHQL Query Language for Conceptual History Relying on Google Books

Introduction

The digitization of large time-labeled bibliographies has resulted in corpora such as the Google Ngram data set (Lin et al., 2012). Such corpora extremely accurately reflect how individual words are used over time. They are expected to reveal novel insights into the evolution of language and society, provided adequate analysis systems are available. In this context, developing a comprehensive query algebra that allows domain experts to formalize complex hypotheses would be a major contribution to successfully unlock this potential.

The case of conceptual history serves as our example from the humanities. In conceptual history, researchers examine the evolution of concepts represented by words such as “peace” or “freedom”. In exploring the history of a concept, scholars commonly make use of, but are not restricted to, word-usage frequencies, word contexts, sentiment analysis, how words refer and relate to and contrast with each other, or they look for word pairs or word families whose usage is correlated (Brunner et al., 2004; Ritter and Gründer, 1971). Consider our example: how the words “East” and “West” change from merely cardinal directions to politically charged concepts after 1945.

In this paper, we present a query algebra for empirical analyses of temporal text corpora, the Conceptual History Query Language (CHQL). A temporal text corpus in our sense is a set of words and word chains, i.e., ngrams, together with their usage frequency at various points of time. Our query language is meant to be useful for domain experts, i.e., be descriptive and complete (match all actual and potential hypotheses of conceptual history), and bear optimization potential to allow fast query processing on large data sets. We focus on an algebra inspired by the German tradition of Begriffsgeschichte (conceptual history), as exemplified by the work of Reinhart Koselleck (Olsen, 2012).

Related Work

Existing query algebras, like the one for the Structured Query Language (SQL), do not feature specific support for analyses of the kind we envisage. Other approaches from the literature, e.g., the Contextual Query Language (The Library of Congress, 2013), the Corpus Query Language (Jakubíček et al., 2010), or the ANNIS Query Language (Zeldes et al., 2009), have similar issues. The common relational algebra (Maier, 1983; Abiteboul et al., 1994), does not contain sufficiently specific operators, e.g., temporal or linguistic operators. Extensions exist to add temporal operators (Snodgrass, 1987; Snodgrass, 1995), but not linguistic operators. To query relations between words, there are special-purpose query languages. For example, SQWRL is a language to query an ontology (O'Connor and Das, 2009). Querying word relations, e.g., from an ontology, does not include all required linguistic relationships. Further, ontologies do not provide temporal information. SQWRL does not contain any temporal operator. All of these algebras have in common that they do not cover both linguistic and temporal operators required for research on conceptual history.

Related work in the digital humanities mainly consists of data processing and the analysis of text corpora (Warwick et al., 2012; Hai-Jew, 2017). Some frameworks focus on linguistic and reflective properties as well as their evolution such as (Hamilton et al., 2016a; Hamilton et al., 2016b; Prabhakaran et al., 2016; Englhardt et al., 2019). Respective systems cannot output the required information to conduct research on conceptual history in a comprehensive way. In addition, such systems do not provide a sufficiently abstract interface, a reason why experts are reluctant in using them (Hai-Jew, 2017).

Concept Types and Operators

This section shows in the abstract how the operators of CHQL allow searching for concept types. A formal definition of all of our operators is given in (Willkomm et al., 2018) and will be presented at DH2019.

Conceptual history claims that pragmatic properties of historical, cultural and economic relevance are incorporated in concepts, irrespectively of whether individual users are aware of this or not. It attempts to track changes of particular concepts (such as “socialism”) over time to determine how their pragmatic relevance changes (it might mostly express generic hopes at some moment and mostly specific fears at some other). Thus, concepts will be categorized as belonging to a particular concept type at a particular moment in time.

Conceptual historians typically read and interpret large masses of texts which provide a variety of information types (e.g. word frequencies, what words appear in the context, how these words function pragmatically (individually as well as in sentences etc.)) which help to determine the concept type. Because we want to do the same using Distant Reading techniques (Moretti, 2013), these information types need to be translated into observable data characteristics for which individual operators in the query language are defined. Finding an adequate number of helpful information types, structuring them and converting them into computable and combinable items is the main challenge of our project.

Since there is no accepted formal specifications of information types, we describe an interpretation of Koselleck’s information types in order to map them on to data characteristics. Data characteristics are quantitative feature either directly present in our data (e.g., the usage frequency of the word “socialism” in 1848), or a derived piece of information (e.g. the difference between the usage frequency of words “socialism” and “communism” from 1848 to 1989). We describe which data characteristics are needed to simulate Koselleck’s information needs and explain our realization of all data characteristics and their implementation as operators.

The relationship between concept types, information types, data characteristics and operators to hypothesize concept types

One of Koselleck’s implicit assumptions is that each concept type has specific characteristics. In our terminology: any concept type can be described using a specific combination of information types. For example, Koselleck may plausibly be read as claiming that words that form a parallel concept (concept type) would have “similar” word frequencies and have a significant number of identical surrounding words (information types). By contrast, counter concepts would also have similar word frequencies yet their surrounding words would behave differently. For instance, if “enlightenment” and “reason” are parallel concepts for a particular period, their relative word frequencies should be similar, and if “emancipation” occurs near “enlightenment”, it should occur near “reason” too, and both concepts should be endorsed rather than criticized (in some sense). By contrast, if “East” and “West” are counter concepts, their word contexts should contain different words, and there should be some sort of contrast in attitude between them.

If every concept type has its own specific linguistic and pragmatic properties and hence should be representable by a specific combination of information types, it should be possible to develop a system that finds these information types in large corpora that are not amenable to conventional close reading. To this end, we need a formal definition of any information type which is observable and quantifiable.

We present a selection of some of the data characteristics with the information type they are intended to represent:

Individual Context: This requires two data characteristics: a set of surrounding words for a target word, i.e., the linguistic context, and the sentiment for this context, by summing up the sentiment values of the words of the context. Our surroundingwords operator and sentiment operator implement this. Topic Grouping: Using topic modeling, groups of words may be classified as belonging to a particular topic (e.g. geography or politics). Sentence Structure: This again requires two data characteristics: the function of a word, i.e., differentiate between parts of speech, and completing phrases, i.e., search for missing words in a phrase. The first data characteristic is implemented by our operator pfilter. We implement the second one as a pattern-matching operator which we call textsearch. Frequency Data: Neologisms, which might be evidence for radical changes, would display abrupt increases in word-usage frequency over time. To find this and similar characteristics, we propose an operator time series-based selection that compares the time-series values with a constant. To allow for a temporal restriction, we also provide a subsequence operator that limits the selection to an arbitrary time interval. The combination of both operators facilitates the search for neologisms. Sentiment Analysis: Using well-proven resources such as LIWC (Wolf et al., 2008) or customized dictionaries, our sentiment operator represents the emotions associated with a concept, relying on the words in its context.

Results

Using CHQL, we have tested the hypotheses that (1) “East” and “West” have acquired a political context after 1945, whereas “North” and “South” haven’t, and that (2) the former have turned into counter concepts in the political sphere, their contexts expressing diverging attitudes, whereas the latter have remained parallel concepts in the geographical sphere. The operator trees 1 and 2 shown in Figures 2 and 4 illustrate how CHQL allows combining the operators mentioned to perform a single search, yielding the results shown in Figures 3 and 5.

Formalisation of hypothesis 1 in CHQL

The result of Query 1 on the Google Books Ngram Corpus

Formalisation of hypothesis 2 in CHQL

The result of Query 2 on the Google Books Ngram Corpus

Bibliography Abiteboul, S., Hull, R. and Vianu, V. (1994). Foundations of Databases: The Logical Level . Pearson. Brunner, O., Conze, W. and Koselleck, R. (eds). (2004). Geschichtliche Grundbegriffe (Volumes 1-8) . Klett-Cotta Verlag. Englhardt, A., Willkomm, J., Schäler, M. and Böhm, K. (2019). Improving Semantic Change Analysis by Combining Word Embeddings and Word Frequencies. International Journal on Digital Libraries . Hai-Jew, S. (ed). (2017). Data Analytics in Digital Humanities . Springer-Verlag GmbH. Hamilton, W., Leskovec, J. and Jurafsky, D. (2016a). Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’16) . pp. 2116–2121. Hamilton, W., Leskovec, J. and Jurafsky, D. (2016b). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. Jakubíček, M., Kilgarriff, A., McCarthy, D. and Rychlý, P. (2010). Fast syntactic searching in very large corpora for many languages. Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation (PACLIC 24) . pp. 741–747. Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W. and Petrov, S. (2012). Syntactic annotations for the Google Books Ngram Corpus. Proceedings of the ACL 2012 System Demonstrations (ACL ’12) . Association for Computational Linguistics, pp. 169–174. Maier, D. (1983). Theory of Relational Databases . Computer Science Press. Moretti, F. (2013). Distant Reading . Verso Books. O’Connor, M. and Das, A. (2009). SQWRL: a query language for OWL. Proceedings of the 6th International Conference on OWL: Experiences and Directions (OWLED ’09) . pp. 208–215. Olsen, N. (2012). History in the Plural: An Introduction to the Work of Reinhart Koselleck . Berghahn Books Inc. Prabhakaran, V., Hamilton, W., McFarland, D. and Jurafsky, D. (2016). Predicting the Rise and Fall of Scientific Topics from Trends in their Rhetorical Framing. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL ’16) . Association for Computational Linguistics, pp. 1170–1180 doi:10.18653/v1/p16-1111. Ritter, J. and Gründer, K. (eds). (1971). Historisches Wörterbuch Der Philosophie (13 Volume Set) . Schwabe. Snodgrass, R. (1987). The temporal query language TQuel. ACM Transactions on Database Systems , 12 (2): 247–298 doi:10.1145/22952.22956. Snodgrass, R. (ed). (1995). The TSQL2 Temporal Query Language . . Vol. 330. (The Springer International Series in Engineering and Computer Science). Springer US. The Library of Congress (2013). The Contextual Query Language . http://www.loc.gov/standards/sru/cql/. Warwick, C., Terras, M. and Nyhan, J. (eds). (2012). Digital Humanities in Practice . Facet Publishing. Willkomm, J., Schmidt-Petri, C., Schäler, M., Schefczyk, M. and Böhm, K. (2018). A Query Algebra for Temporal Text Corpora. Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (JCDL ’18) . ACM Press, pp. 183–192 doi:10.1145/3197026.3197044. Wolf, M., Horn, A., Mehl, M., Haug, S., Pennebaker, J. and Kordy, H. (2008). Computergestützte quantitative Textanalyse. Diagnostica : 85–98 doi:10.1026/0012-1924.54.2.85. Zeldes, A., Lüdeling, A., Ritz, J. and Chiarcos, C. (2009). ANNIS: A Search Tool for Multi-Layer Annotated Corpora . Humboldt-Universität zu Berlin, Philosophische Fakultät II doi:10.18452/13437.