Converted from a Word document
Science gradually developed into an established sociocultural domain starting from the mid-17
Of these 205 years, 159 years actually contain documents (mean = 56.7, median=36, sd=61.6, min=12, max=444)th century onwards. In this process it became increasingly specialized and diversified. Here, we investigate a particular aspect of specialization on the basis of probabilistic topic models. As a corpus we use the Royal Society Corpus (Khamis et al. 2015), which covers the period from 1665 to 1869 and contains 9015 documents
We follow the overall approach of applying topic models to diachronic corpora (Blei and Lafferty 2006, Hall et al. 2008, Griffiths and Steyvers 2004, McFarland et al. 2013, Newman and Block 2006, Yang et al. 2011) to map documents to topics. Probabilistic topic models (Steyvers and Griffiths 2007) have become a popular means to summarize and analyze the content of text corpora. The principle idea is to model the generation of documents with a randomized two-stage process: For every word
d select a topic
This factorization effectively reduces the dimensionality of the model for documents, improving their interpretability: Whereas
latent variables: A variety of approaches exist to estimate the document-topic and topic-word distributions from the
observable document-word distributions. We use Gibbs-Sampling as implemented in Mallet (McCallum 2002).
For the preliminary analysis in this paper, we process documents as is, without segmenting them further into pages, only excluding stop words but not performing lemmatization or normalization in order to stay reasonably close to the original source. We experimented with the number of topics ranging between 20 and 30, reporting here results on 24 topics. Cursory analysis of multiple runs with different seeds (Steyvers and Griffiths 2007) shows that the resulting topics are rather stable.
Table 1 displays the top words for the topics with manually assigned labels and their overall percentage of occurrence. We can roughly distinguish four groups of topics; three non-thematic groups and one thematic. The first group comprises topics arising from documents in
Latin and
French, some of which are also translated into English. The second group
Formulae and
Tables relates to highly formalized modes of information presentation. The third group of topics is also clearly non-thematic but relates to general scientific processes:
Observation and
Experiment both contain rather general verbs and adjectives in addition to nouns.
Events contains words describing remarkable events.
Headmatter includes formulaic expressions typically occurring at the beginning and end of documents that are letters. All topics in this group are relatively frequent. Finally, the topics in the fourth group (
Geography through
Chemistry), consisting mainly of nouns, indeed have a fairly clear thematic interpretation.
Table 1: Top words and percentages for topics
To investigate topical trends in the corpus we follow the approach in (Hall et al. 2008), by averaging the document-topic distributions for each year
with
Figure 1 shows a selection of five topics with the most pronounced change over time. Interestingly, some of the major changes occur for non-thematic topics: The topic
Observation declines sharply from over 30% to less than 1 %. The topics
Experiment and
Formulae on the other hand increase starting around 1750. This indicates a substantial paradigm shift over time. Indeed, as Gleick (2010) vividly describes, the early stages of the Royal Society were largely devoted to observing and reporting about natural phenomena. The non-thematic topic
Latin reaches its peak in the early 18
th century, and the thematic topics
Cells and
Chemistry show a clear increase with the beginning of the 19
th century.
To gain a better understanding about the correlation of topics, we cluster them hierarchically on the basis of the Jensen-Shannon divergence between the topic-document distributions:
Topics that typically co-occur in documents have similar topic-document distributions, and thus will be placed close in the tree.
The resulting tree in
Figure 2 indeed identifies meaningful subgroups. Cutting the tree into six groups -
Nature, Latin, Medicine, Astronomy, Engineering, Matter - allows us to investigate the overall topic distribution over time (
Figure 3 with
Latin left out):
The topic group
Nature comprising reports of all kinds of natural phenomena (Gleick 2010) clearly decreases over time, which is partially to be attributed to the strong decrease of the topic
Observation in this group. The topic groups
Medicine and
Astronomy increase over time, whereas the topic groups
Engineering and
Matter also generally increase but with some intermediate peaks. Similar to the overall trends at the level of individual topics (
Figure 1), the biggest overall change occurs in the 2
nd half of the 18
th century.
Looking at the individual trends together,
Figure 3 clearly indicates topical diversification: Until around 1770, the dominance of the topic group
Nature leads to a highly skewed distribution of topic groups, whereas after 1770 topic groups are distributed much more evenly. The amount of skew can be characterized by the Shannon-Entropy:
of the year-topic distributions
Figure 4 (left), the entropy (
ent) increases fairly consistently during the 18
th century and levels out during the 19
th century, reflecting a general increase of topical diversity over time.
It is interesting to compare this with the mean entropy of the
individual document-topic distributions (
ment):
with
The difference between the entropy of year-topic distributions and mean entropy of individual document-topic distributions,
is the Jensen-Shannon divergence, which is usually applied to two distributions, generalized to the
y. The two opposing trends of these quantities lead to a constantly increasing Jensen-Shannon divergence, with a particularly sharp increase between 1750 and 1800.
Figure 4 (right) depicts similar trends based on the 24 individual topic distributions. At this level, the year-topic entropy (
ent) shows less of a clear trend, but mean entropy (
ment) also clearly decreases, and consequently the Jensen-Shannon divergence clearly increases. Thus, at both levels of abstraction we can observe a clear diversification of the topics assigned to the individual documents. This strongly indicates a growing separation of individual scientific disciplines over time.
As an alternative perspective on topical entropy
Table 2 gives examples of authors with more than 20 papers. The first three authors have the lowest entropy. The dominating topics for Cayley and Owen clearly characterize their main theme of work. Conversely, Rev. John Swinton’s top topic
Headmatter (62%) does not really reflect the overall theme of his publications (Orientalism), but rather their style as letters to members of the Royal Society – the dominant form of publication in this period. The second three authors have the highest entropy, their three top topics together amount for less than 50% of their overall topic distribution. However, they do characterize the main line of work of the authors in question fairly well.
Table 2: Authors with minimum entropy (top) and maximum entropy (bottom)
In this paper we have analyzed the progression of topics in a corpus of the Royal Society of London. Our main result is the observation that the overall mixture of topics becomes more diverse over time, while the topics of individual documents become more specialized. These two opposing trends lead to a topical fragmentation of scientific discourse, which can be quantified by means of the generalized Jensen-Shannon divergence between the topic distributions of individual documents per time period. We are currently working on consolidating our analysis, experimenting with documents segmented into pages, focusing the analysis on different text types, and more carefully evaluating the resulting topic models (McFarland et al. 2013).
Of course, topic models only provide one, rather broad perspective on diversification of domain specific language. We plan to apply our approach also to other levels of linguistic analysis, such as terminology or grammar.