SLaTE: A System for Labeling Topics with Entities Anne Lauscher anne@informatik.uni-mannheim.de University of Mannheim, Germany Federico Nanni federico@informatik.uni-mannheim.de University of Mannheim, Germany Simone Paolo Ponzetto simone@informatik.uni-mannheim.de University of Mannheim, Germany In recent years, the Latent Dirichlet allocation (LDA) topic model (Blei, Ng, and Jordan, 2003) has become one of the most employed text mining techniques (Meeks and Weingart 2012) in the digital humanities (DH). Scholars have often noted its potential for text exploration and distant reading analyses, even when it is well known that its results are difficult to interpret (Chang et al, 2009) and to evaluate (Wallach et al, 2009). At last year's edition of the Digital Humanities conference, we introduced a new corpus exploration method able to produce topics that are easier to interpret and evaluate than standard LDA topic models (Nanni and Ruiz, 2016). We did so by combining two existing techniques, namely Entity linking and Labeled LDA (L-LDA). At its heart, our method first identifies a collection of descriptive labels for the topics of arbitrary documents from a corpus, as provided from the vocabulary of entities found within wide-coverage knowledge resources (e.g., Wikipedia, DBpedia). Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation easier, and using a disambiguated knowledge resource as background knowledge limits label ambiguity. As our topics are described with a limited number of unambiguous labels, they promote in-terpretability, and this may sustain the use of the results as quantitative evidence in humanities research (Lauscher et al, 2016). The contributions of this poster cover the release of: a) a complete implementation of the processing pipeline for our entity-based LDA approach; b) a three-step evaluation platform that enables its extensive quantitative analysis. Entity-based Topic Modeling Pipeline Figure 1 illustrates the computational pipeline of our system; python classes are represented in rectangles. First of all, a set of text files is imported into the system and several preprocessing steps are applied to the textual content. Next, the data is sent to the entity linking system TagMe (Ferragina and Scaiella, 2010), which disambiguates against Wikipedia. As a result of this step, for each document a set of related Wikipedia entities is retrieved. Now, the data is inserted into a MySQL database. [063-1] Figure 1. Architecture of the pipeline Afterwards, the TF-IDF measure is computed over the entities, which we use to rank all the entities for each document in descending order. Then, the top k entities as well as their corresponding documents are exported into a comma-separated values file that is given as input to the L-LDA implementation of the Stanford Topic Modeling Toolbox. Finally, after running L-LDA and applying several post-processing steps, we obtain a document-topic distribution saved in the database in which each topic is described by an unambiguous label linked to Wikipedia. The whole source code is available for public download on Github. Given a working Python, Java, and Scala runtime as well as a running MySQL installation our pipeline is ready directly out-of-the-box. The specific configuration according to the user's needs can be made via a simple text file. Three-Step Evaluation Platform Document Labels In order to assess the quality of the detected entities as labels we developed a specific browser-based evaluation platform, which permits manual annotations. This platform presents a document on the right of the screen and a set of possible labels on the left (See Figure 2). Annotators are asked to pick labels that precisely describe the content of each document. In case the annotator does not select any label, this is also recorded by our evaluation system. Entities and Topic Words In order to establish if the selected entities were the right labels for the topics produced, we developed two additional evaluation steps. Inspired by the topic intrusion task (Chang et al, 2009), we designed a platform that permits to evaluate the relations between labels and topics using two evaluation modes: For one evaluation mode (that we called Label Mode - Figure 3), the annotator is asked to choose, when possible, the correct list of topic-words given a label. For the other, he/she was asked to pick the right label given a list of topic words (aee Figure 4). In both cases, the annotator is shown three options: one of them is the correct match, while the other two (be they words or labels) come from other topics related to the same document. [063-2] Figure 2. Entities as Labels evaluation interface. [063-3] Figure 3. Label-Mode Evaluation ┌───────────────────────┬──────────────────────┬────────────────────────┐ │Option 1: Agriculture │Option 2: Organization│Option 3: Business cycle│ ├───────────────────────┼──────────────────────┼────────────────────────┤ ├───────────────────────┼──────────────────────┼────────────────────────┤ │A HOME «2 LABEL MODE│ │ │ └───────────────────────┴──────────────────────┴────────────────────────┘ Figure 4. Term-Mode Evaluation