Introducing Polo: Exploring Topic Models as Database and Hypertext

Introducing Polo: Exploring Topic Models as Database and Hypertext Alvarado Rafael University of Virginia, United States of America ontoligent@gmail.com 2014-12-19T13:50:00Z Name, Institution

Street City Country Name

Converted from a Word document

DHConvalidator Paper Poster (was submitted as Short Paper) topic models interpretation corpora and corpus activities databases & dbms anthropology data mining / text mining English computer science artificial intelligence and machine learning

Since the invention of Latent Dirichlet Allocation (Blei, et al. 2003) and early demonstrations of its utility for identifying lexical clusters in collections of historical and literary texts (Block and Newman 2006, Blevins 2010), topic models have become a mainstay of the digital humanities. However, the use of topic models within the field remains narrowly conceived, restricted largely, with some exceptions, to the discovery of topics and topic trends within corpora, even though the method has been extended significantly since first introduced. One reason for this conservativism may be that, like many methods drawn from data science, both the process and the output of topic model algorithms remain interpetively opaque to the humanists (and, arguably, to the computer scientist as well). Aside from the complexity of the math involved, a contributing factor to this opacity has been the limited way in which the results of topic models are presented to the user. One the one hand, the data provided by standard topic modeling tools (whether in Java, Python, or R) are often trapped in data files or shielded by objects that cannot be queried directly or visualized freely without the use of ad hoc programming or spreadsheet software. On the other hand, the outputs typically provided by these tools, such as top words per topic (often visualized as word clouds), show a highly restricted, decontextualized, and potentially distorted picture of the model (Schmidt 2013). Recently, various tools have emerged to fill this gap, such TOME (Klein et al. 2015), which is designed to allow scholars to explore topic models more fully. In this talk I will present Polo, a topic model browser developed at the Data Science Institute at the University of Virginia designed to present topic models to users in a direct, transparent, and complete manner, so that the representational quality of models may be explored, questions, and adjusted interactively. Built on top of MALLET, Gensim, and NLTK, Polo is a Python package that provides tools to both create topic models and to inspect them by combining the source corpus with all of the data produced by the core software into a single, normalized relational database (in SQLite). This database in turn forms the foundation of an interactive web application that effectively converts the output model with associated data and the source corpus into a single hypertext relating words, topics, and documents. A key design feature of Polo is that it employs the statistical properties of the model -- such as topic entropy in documents or mutual information among topics -- not simply as readouts on a dashboard but as navigational devices that allow the user to move from a reduced dimension, high-level perspective of a corpus to its source documents, and to move laterally through the network of topics and documents that compose the model. Using examples from both newspaper and journal collections, I will demonstrate how Polo enables scholars both to investigate implied cultural newtworks in these corpora and to explore the various ways in which topics may be said to convey meaning.

Bibliography Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2002. “Latent Dirichlet Allocation.” In Advances in Neural Information Processing Systems 14, edited by T. G. Dietterich, S. Becker, and Z. Ghahramani, 601–608. MIT Press. Blevins, Cameron. 2010. “Topic Modeling Martha Ballard’s Diary.” Cameron Blevins (blog). April 1, 2010, http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/. Klein, Lauren F., Jacob Eisenstein, and Iris Sun. 2015. “Exploratory Thematic Analysis for Digitized Archival Collections.” Digital Scholarship in the Humanities 30 (suppl_1):i130–41. Newman, David J., and Sharon Block. 2006. “Probabilistic Topic Decomposition of an Eighteenth-Century American Newspaper.” Journal of the American Society for Information Science and Technology 57 (6):753–767. Schmidt, Benjamin M. 2013. “Words Alone: Dismantling Topic Models in the Humanities.” Journal of Digital Humanities. April 5, 2013.