Converted from a Word document
The digitization of large time-labeled bibliographies has resulted in corpora such as the Google Ngram data set (Lin et al., 2012). Such corpora extremely accurately reflect how individual words are used over time. They are expected to reveal novel insights into the evolution of language and society, provided adequate analysis systems are available. In this context, developing a comprehensive query algebra that allows domain experts to formalize complex hypotheses would be a major contribution to successfully unlock this potential.
The case of conceptual history serves as our example from the humanities. In conceptual history, researchers examine the evolution of concepts represented by words such as “peace” or “freedom”. In exploring the history of a concept, scholars commonly make use of, but are not restricted to, word-usage frequencies, word contexts, sentiment analysis, how words refer and relate to and contrast with each other, or they look for word pairs or word families whose usage is correlated (Brunner et al., 2004; Ritter and Gründer, 1971). Consider our example: how the words “East” and “West” change from merely cardinal directions to politically charged concepts after 1945.
In this paper, we present a query algebra for empirical analyses of temporal text corpora, the Conceptual History Query Language (CHQL). A
temporal text corpus in our sense is a set of words and word chains, i.e., ngrams, together with their usage frequency at various points of time. Our query language is meant to be useful for domain experts, i.e., be descriptive and complete (match all actual and potential hypotheses of conceptual history), and bear optimization potential to allow fast query processing on large data sets. We focus on an algebra inspired by the German tradition of
Begriffsgeschichte (conceptual history), as exemplified by the work of Reinhart Koselleck (Olsen, 2012).
Existing query algebras, like the one for the Structured Query Language (SQL), do not feature specific support for analyses of the kind we envisage. Other approaches from the literature, e.g., the Contextual Query Language (The Library of Congress, 2013), the Corpus Query Language (Jakubíček et al., 2010), or the ANNIS Query Language (Zeldes et al., 2009), have similar issues. The common relational algebra (Maier, 1983; Abiteboul et al., 1994), does not contain sufficiently specific operators, e.g., temporal or linguistic operators. Extensions exist to add temporal operators (Snodgrass, 1987; Snodgrass, 1995), but not linguistic operators. To query relations between words, there are special-purpose query languages. For example, SQWRL is a language to query an ontology (O'Connor and Das, 2009). Querying word relations, e.g., from an ontology, does not include all required linguistic relationships. Further, ontologies do not provide temporal information. SQWRL does not contain any temporal operator. All of these algebras have in common that they do not cover both linguistic and temporal operators required for research on conceptual history.
Related work in the digital humanities mainly consists of data processing and the analysis of text corpora (Warwick et al., 2012; Hai-Jew, 2017). Some frameworks focus on linguistic and reflective properties as well as their evolution such as (Hamilton et al., 2016a; Hamilton et al., 2016b; Prabhakaran et al., 2016; Englhardt et al., 2019). Respective systems cannot output the required information to conduct research on conceptual history in a comprehensive way. In addition, such systems do not provide a sufficiently
abstract interface, a reason why experts are reluctant in using them (Hai-Jew, 2017).
This section shows in the abstract how the operators of CHQL allow searching for concept types. A formal definition of all of our operators is given in (Willkomm et al., 2018) and will be presented at DH2019.
Conceptual history claims that pragmatic properties of historical, cultural and economic relevance are incorporated in concepts, irrespectively of whether individual users are aware of this or not. It attempts to track changes of particular concepts (such as “socialism”) over time to determine how their pragmatic relevance changes (it might mostly express generic hopes at some moment and mostly specific fears at some other). Thus, concepts will be categorized as belonging to a particular
concept type at a particular moment in time.
Conceptual historians typically read and interpret large masses of texts which provide a variety of information types (e.g. word frequencies, what words appear in the context, how these words function pragmatically (individually as well as in sentences etc.)) which help to determine the concept type. Because we want to do the same using
Distant Reading techniques (Moretti, 2013), these information types need to be translated into observable data characteristics for which individual operators in the query language are defined. Finding an adequate number of helpful information types, structuring them and converting them into computable and combinable items is the main challenge of our project.
Since there is no accepted formal specifications of information types, we describe an interpretation of Koselleck’s information types in order to map them on to data characteristics. Data characteristics are quantitative feature either directly present in our data (e.g., the usage frequency of the word “socialism” in 1848), or a derived piece of information (e.g. the difference between the usage frequency of words “socialism” and “communism” from 1848 to 1989). We describe which data characteristics are needed to simulate Koselleck’s information needs and explain our realization of all data characteristics and their implementation as operators.
One of Koselleck’s implicit assumptions is that each concept type has specific characteristics. In our terminology: any concept type can be described using a specific combination of information types. For example, Koselleck may plausibly be read as claiming that words that form a
parallel concept (concept type) would have “similar”
word frequencies and have a significant number of identical
surrounding words (information types). By contrast,
counter concepts would also have similar word frequencies yet their surrounding words would behave differently. For instance, if “enlightenment” and “reason” are parallel concepts for a particular period, their relative word frequencies should be similar, and if “emancipation” occurs near “enlightenment”, it should occur near “reason” too, and both concepts should be endorsed rather than criticized (in some sense). By contrast, if “East” and “West” are counter concepts, their word contexts should contain different words, and there should be some sort of contrast in attitude between them.
If every concept type has its own specific linguistic and pragmatic properties and hence should be representable by a specific
combination of information types, it should be possible to develop a system that finds these information types in large corpora that are not amenable to conventional close reading. To this end, we need a formal definition of any information type which is observable and quantifiable.
We present a selection of some of the data characteristics with the information type they are intended to represent:
Using CHQL, we have tested the hypotheses that (1) “East” and “West” have acquired a political context after 1945, whereas “North” and “South” haven’t, and that (2) the former have turned into counter concepts in the political sphere, their contexts expressing diverging attitudes, whereas the latter have remained parallel concepts in the geographical sphere. The operator trees 1 and 2 shown in Figures 2 and 4 illustrate how CHQL allows combining the operators mentioned to perform a single search, yielding the results shown in Figures 3 and 5.