SLaTE:

A System for Labeling Topics with Entities

Anne Lauscher

anne@informatik.uni-mannheim.de University of Mannheim, Germany

Federico Nanni

federico@informatik.uni-mannheim.de University of Mannheim, Germany

Simone Paolo Ponzetto

simone@informatik.uni-mannheim.de University of Mannheim, Germany

In recent years, the Latent Dirichlet allocation (LDA) topic model (Blei, Ng,
and Jordan, 2003) has become one of the most employed text mining techniques
(Meeks and Weingart 2012) in the digital humanities (DH). Scholars have often
noted its potential for text exploration and distant reading analyses,
even when it is well known that its results are difficult to interpret (Chang
et al, 2009) and to evaluate (Wallach et al, 2009).

At last year's edition of the Digital Humanities conference, we introduced a
new corpus exploration method able to produce topics that are easier to
interpret and evaluate than standard LDA topic models (Nanni and Ruiz, 2016).
We did so by combining two existing techniques, namely Entity linking and
Labeled LDA (L-LDA). At its heart, our method first identifies a collection of
descriptive labels for the topics of arbitrary documents from a corpus, as
provided from the vocabulary of entities found within wide-coverage knowledge
resources (e.g., Wikipedia, DBpedia). Then it generates a specific topic for
each label. Having a direct relation between topics and labels makes
interpretation easier, and using a disambiguated knowledge resource as
background knowledge limits label ambiguity. As our topics are described with a
limited number of unambiguous labels, they promote in-terpretability, and this
may sustain the use of the results as quantitative evidence in humanities
research (Lauscher et al, 2016).

The contributions of this poster cover the release of: a) a complete
implementation of the processing pipeline for our entity-based LDA approach; b)
a three-step evaluation platform that enables its extensive quantitative
analysis.

Entity-based Topic Modeling Pipeline

Figure 1 illustrates the computational pipeline of our system; python classes
are represented in rectangles. First of all, a set of text files is imported
into the system and several preprocessing steps are applied to

the textual content. Next, the data is sent to the entity

linking system TagMe (Ferragina and Scaiella, 2010), which disambiguates
against Wikipedia. As a result of this step, for each document a set of related
Wikipedia entities is retrieved. Now, the data is inserted into a MySQL
database.

[063-1]

Figure 1. Architecture of the pipeline

Afterwards, the TF-IDF measure is computed over the entities, which we use to
rank all the entities for each document in descending order. Then, the top
k entities as well as their corresponding documents are exported into a
comma-separated values file that is given as input to the L-LDA implementation
of the Stanford Topic Modeling Toolbox. Finally, after running L-LDA and
applying several post-processing steps, we obtain a document-topic distribution
saved in the database in which each topic is described by an unambiguous label
linked to Wikipedia.

The whole source code is available for public download on Github. Given a
working Python, Java, and Scala runtime as well as a running MySQL installation
our pipeline is ready directly out-of-the-box. The specific configuration
according to the user's needs can be made via a simple text file.

Three-Step Evaluation Platform

Document Labels

In order to assess the quality of the detected entities as labels we developed
a specific browser-based evaluation platform, which permits manual annotations.
This platform presents a document on the right of the screen and a set of
possible labels on the left (See Figure 2). Annotators are asked to pick labels
that precisely describe the content of each document. In case the annotator
does not select any label, this is also recorded by our evaluation system.

Entities and Topic Words

In order to establish if the selected entities were the right labels for the
topics produced, we developed two additional evaluation steps. Inspired by the
topic intrusion task (Chang et al, 2009), we designed a platform that permits
to evaluate the relations between labels and topics using two evaluation modes:
For one evaluation mode (that we called Label Mode - Figure 3), the annotator
is asked to choose, when possible, the correct list of topic-words given a
label. For the other, he/she was asked to pick the right label given a list
of topic words (aee Figure 4). In both cases, the annotator is shown three
options: one of them is the correct match, while the other two (be they words
or labels) come from other topics related to the same document.

[063-2]

Figure 2. Entities as Labels evaluation interface.

[063-3]

Figure 3. Label-Mode Evaluation

┌───────────────────────┬──────────────────────┬────────────────────────┐
│Option 1: Agriculture  │Option 2: Organization│Option 3: Business cycle│
├───────────────────────┼──────────────────────┼────────────────────────┤
├───────────────────────┼──────────────────────┼────────────────────────┤
│A HOME    «2 LABEL MODE│                      │                        │
└───────────────────────┴──────────────────────┴────────────────────────┘

Figure 4. Term-Mode Evaluation