Converted from an OASIS Open Document
In the digital humanities, topic models are a widely applied text mining method (Meeks and Weingart, 2012). While their use for mining literary texts is not entirely straightforward (Schmidt, 2012), there is ample evidence for their use on factual text (e.g. Au Yeung and Jatowt, 2011; Thompson et al., 2016). We present an approach for exploring the research fields of selected authors in a corpus of late modern scientific English by topic modelling, looking at the topics assigned to an author’s texts over the author’s lifetime. Areas of applications we target are history of science, where we may be interested in the evolution of scientific disciplines over time (Thompson et al., 2016; Fankhauser et al., 2016), or diachronic linguistics, where we may be interested in the formation of languages for specific purposes (LSP) or specific scientific “styles” (cf. Bazerman, 1988; Degaetano-Ortlieb and Teich, 2016).
We use the
Royal Society Corpus (RSC, Kermes et al., 2016), which is based on the first two centuries (1665–1869) of the
Philosophical Transactions and the
Proceedings of the
Royal Society of London. The corpus contains 9,779 texts (32 million tokens) and is available at
The topic modelling approach we take as a basis is Latent Dirichlet Allocation (LDA, Blei et al., 2003). LDA assumes that corpora contain a number of recurring topics and it treats texts as bags of words. Topics, which can be regarded as groups of semantically related words, are represented as probability distributions over words and each text is treated as a mixture of topics. Typically, topics are displayed as lists of the most probable words and labels are assigned manually. We also considered author-topic models (Rosen-Zvi et al., 2010) but their author-topic matrix implies that authors’ topics are fixed over time.
As disciplines were not part of the original metadata of the RSC, we applied topic modelling to approximate disciplines. Using MALLET (McCallum, 2002), we built a model with 24 topics, which are shown along with their characteristic words in Figure 2.
Following the approach of Fankhauser et al. (2016), we clustered the topics using Jensen–Shannon divergence. Figure 3 shows the resulting topic hierarchy. Based on this clustering, we identified broader research areas, which we marked on the branches of the dendrogram.
Using these broader categories, we explore whether individual authors stayed in the same area or shifted their focus during their time of scientific production. For this purpose, we selected the most prolific authors (29–198 articles) in the RSC and tracked their topics over time (see Figures 4 and 5). We excluded names if we could not identify the author in the
Virtual International Authority File or if publication years did not match the author’s lifetime.
Figure 4 shows the topics used by twelve authors during their career. We can see two groups of authors. Authors like
Arthur Cayley dedicated their life to a single research area whereas
Humphry Davy worked on two topics or in an interdisciplinary area. Figure 5 shows the development of the same authors over time. Overall, the authors’ interests did not change dramatically over their professional life. However, one can identify a peak of productivity for most authors.
We proposed to use topic modelling as a method of exploring the development of the scientific orientation of individual authors over time. Taking topic as an approximation of discipline, our approach can be used to explore the contribution of a particular author to a given discipline over time or find authors with potentially interesting production profiles (e.g. authors shifting topics). In our future work, we will improve our models (e.g. avoid potential confusion of namesakes) by better metadata on the authors which we will obtain from the Royal Society.
We acknowledge the support of DFG (Deutsche Forschungsgemeinschaft) through the Cluster of Excellence
Multimodal Computing and Interaction (MMCI).