Literary Exploration Machine: A New Tool for Distant Readers of Polish
Literature

Maciej Piasecki

maciej.piasecki@pwr.edu.pl

Wroclaw University of Science and Technology Poland

Tomasz Walkowiak

tomasz.walkowiak@pwr.edu.pl

Wroclaw University of Science and Technology Poland

Maciej Maryl

maciej.maryl@ibl.waw.pl

Polish Academy of Sciences, Poland

Brief Summary

This paper presents an initial prototype of a web-based application for textual
scholars. The goal of this project is to create a complex and stable research
environment allowing scholars to upload the texts they are analysing and either
explore with a suite of dedicated tools or transform them into another format 
(text, table, list). This latter functionality is especially important for
research into Polish texts, because it allows for further processing with the
tools built for the English language.

This application brings together the existing applications developed by
CLARIN-PL and supplements them with new functionalities. The project is based
on a close cooperation between IT professionals, linguists and literary
scholars, which ensures that the tools will suit actual researchers' needs.

The main features of LEM include: lemmatization, part-of-speech tagging, text
clustering, semantic text classification based on machine learning, and
visualisation of its output, generating custom wordlists and lemmatized texts.

Challenge

Digital literary studies seem to be one of the most vividly developing strand
of digital humanities. Different analytical systems were proposed, e.g. Mallet,
Phil-oLogic3 plus PhiloMine, but focused on selected techniques and mostly on
English texts. Their languageprocessing capabilities are limited only to
lemmatiza-tion and morphosyntactic tagging and they usually require from their
users certain programming skills.

In order to address those challenges we have developed a prototype of a
web-based system, called Literary Exploration Machine (LEM), which does not
require installation and programming skills. LEM has a component-based
architecture, remains open for expanding components, implements natural
language processing on different levels and is planned to support several
different paradigms of the text analysis.

Scheme of the system

Components

Word frequencies can be simply computed for English, but not for highly
inflected languages such as Polish, which has more than 100 possible word
forms of an adjective (however, almost-full sets of distinct forms exist only
for some lemmas). In such languages, morphological forms have to be first
mapped to lemmas by a morpho-syntactic tagger, e.g. WCRFT2 for Polish
(Radziszewski, 2013). By applying different language tools, we can enrich texts
with metadata revealing linguistic structures.

LEM expands WebSty - an open stylometric system, adopting the following
features for text description: segmentation-based (lengths of documents,
paragraphs and sentences), morphological (words, punctuations, pseudo-suffixes
and lemmas), grammatical classes and categories (e.g. from the Polish
National Corpus -see Przepiorkowski et al, 2012- tagset, Broda and Piasecki,
2013) and their n-grams.

This set has been additionally expanded in LEM with the following features,
allowing for semantic analysis:

•    semantic Proper Name classes - recognised by a Named Entity Recogniser
Liner2 (Marcinczuk et al, 2013),

•    temporal, spatial relation (Kocon and Marcinczuk, 2015), and selected
semantic binary relations (e.g. owner of) ,

•    lexical meanings - synsets in plWordNet (the Polish wordnet); assigned to
words and selected multiword expressions by Word Sense Disambiguation tool
WoSeDon (K^dzia et al, 2015),

•    generalised lexical meanings - meanings mapped to more general synsets,
e.g. an animal instead of a cheetah,

•    lexicographic domains from Wordnet.

Rich text description is a good basis for several processing paradigms that LEM
is going to support, namely:

• linguistic text preprocessing - extraction of language data for further
statistical analysis, i.e. computing frequencies as the initial feature values,
e.g., of lemmas, tags, word senses, etc.,

• topic modelling,

• unsupervised semantic text clustering and analysis of characteristic features
for clusters,

• supervised semantic text classification -trained on the manually annotated
texts,

• stylometric analysis - performed with the

help of the WebSty system.

Processing scheme

The processing paradigms share the following workflow:

• Uploading a corpus of documents together with metadata in CMDI format
(Broeder et al, 2012) from the CLARIN infrastructure.

• Text extraction and cleaning.

• Choosing the features for the description of

documents by users (see Fig. 1).

• Setting up the parameters for processing (users).

• Pre-processing texts with language tools.

• Calculating feature values for the pre-pro

cessed texts.

• Filtering and/or transforming the original feature values.

• Data mining.

• Presenting the results: visualisation or export of data.

To facilitate the upload, users are encouraged to

deposit large text collections in the CLARI-PL dSpace repository. Users are
advised to use public licences, but private research corpora can be also
uploaded.

OCR-ed documents usually contain many language errors that should be corrected
to some extent in the step 2. Moreover, metadata elements (e.g. page numbers,
headers and footers) have to be separated during from the content and stored in
a standalone annotation.

Users are not expected to have advanced knowledge of Natural Language
Engineering or Data Mining. Thus, in Step 4, default settings of
parameters will be provided. More advanced users will be able to tune the tool
to their needs (see Fig. 1)

[526-1]

Figure 1. Web interface - a panel with a list of features

In Step 5 language tools are run. Each text is analysed by a part-of-speech
tagger (e.g. WCRFT2) and next piped to a name entity recognizer (e.g.
Liner2, Marcinczuk et al, 2013), temporal expression recognition, word sense
recognition (WoSeDon, see K^dzia et al, 2015), etc.

Extraction of features encompasses counting frequencies, but also annotations
matching patterns for every position in a document. In the case of
wordnet-based features, meaning generalisation is done by iterating via wordnet
structure.

A dedicated feature extraction module was built that is similar to Fextor
(Broda et al, 2013) but much more efficient by supporting parallel processing.
As a result of Step 6 every document is represented as vector of feature values
and/or a sequence of language elements.

Filtering and transformation functions comes from the clustering packages or
dedicated systems, e.g. SuperMatrix system (Broda and Piasecki, 2013).

Step 8 differentiates between the processing paradigms. Topic modelling, e.g.
by Mallet, takes documents represented as lemma sequences. They can be also
processed by corpus tools, e.g. for concordances and frequencies. Documents as
feature vectors can be processed by clustering systems e.g. Cluto, or used
in machine learning, e.g. Weka system.

Different processing paradigms provide varied perspectives on the data, e.g.
topic modelling represents a document in terms of stochastic processes
generating word occurrences from topic-related subsets in the text. Clustering
reveals groups of documents based on content similarity. It is difficult to
find a system that supports all paradigms.

In LEM, clustering is expanded with the extraction of features characteristic
for the individual clusters. Several functions (from Weka, scikit-learn and
SciPy packages), based on mathematical statistics, information theory and
machine learning, are offered. The rankings of features are presented on the
screen for interactive browsing and can be downloaded.

WebSty, based on elements of the same framework, can be applied to stylometric
analysis.

Step 9, visualisation of clustering results (see Fig. 4), is based on Spectral
Embedding (also known as Laplacian Eigenmaps). The 3D representation of the
data (represented by similarity matrix) is calculated using a spectral
decomposition of the graph Laplacian. Texts similar to each other are mapped
close to each other in the low dimensional space, preserving local distances.

Use Case

The LEM prototype was developed by the team

working with a particular textual corpus of 2553

Polish texts, published in Teksty Drugie, an academic journal dedicated to
literary studies. The corpus consisted two parts: OCRd scans (1990-1998) and
digital files (1999-2014). Given the aim of this paper (software presentation)
and the shortage of space, we will treat the results only as examples of the
method, without getting into too much detail.

The work on the prototype was divided into stages, conceived as a feedback loop
for the developing team: on every stage a new service was added to
application and the test run was performed. After the analysis of the result,
the step was repeated or the team moved to the next phase.

Phase 1. Cleaning. The OCR-ed corpus has been cleaned (e.g. wordbreaks and
headers were removed)

Phase 2. The corpus was lemmatized and parts of speech were tagged. Frequency
lists were created what enabled the search for patterns in the textual output.
For instance, Figure 2 shows the pattern of interest in particular Polish poets
throughout 25 years, based on lemmatized mentions.

Polish writers discussed in Teksty Drugie

1200

[526-2]

.Czestaw Mrtosz^—Witold Gombrcwicz^—Zbigniew Herbert^— TadeuszRàzewicz    Miron
Biatoszews

Figure 2. Pattern of interest in particular Polish writers in Teksty Drugie
(1990-2014).

Phase 3. The analysis of the word frequencies revealed some problems with the
word list, especially with numbers, years and city names, which were preserved
in bibliographic references. A functionality of adopting a custom stopword list
was employed. The exclusion of corpus-specific problematic words and

general meaningless words (e.g. a, this, that, if) allowed for visualisation of
the most frequent words in Teksty Drugie (Fig. 3)

[526-3]

Figure 3. 300 most frequent words from Teksty Drugie (1990-2014) (meaningless
words excluded) visualised with wordle.

Phase 4. The texts were then grouped into clusters of 20, 50 and 100 in a
series of experiments. Each grouping revealed a bit different level of
generalization about the texts. LEM, thanks to visualisation features (Fig. 4),
allows for real-time exploration of deeper relationships between the texts.

[526-4]

Figure 4. Visualisation of clustering results (weighting: MI-simple, similarity
metric: ratio, number of clusters: 20, clustering method: agglomerative,
visualization: the similarity matrix converted to distances and mapped to 3D by
a spectral decomposition of the graph Laplacian -spectral embedding method).

By choosing the level of granularity (20, 50 or 100 clusters) we may analyse
diverse patterns of discursive similarities between texts. Table 1 shows the
differences in clustering of the same sample. The first option (20) shows the
similarity between texts on a rather general level, that could be described as
stylistic or genre similarity (e.g. formal vocabulary). Other options allow for
more detailed exploration of general research approach (50) or particular
topics analyzed in

articles (100). Semantics of clusters is described by the identified
characteristic features.

[526-5]

Table 1. Differences between the clustering options (numbers reflect the
quantity of texts assigned to particular cluster)

Researchers may explore all options and analyse the vocabulary responsible for
classifying particular texts into a certain group by a virtue of being over-
or under-represented in comparison to the entire sample.

The LEM is not a real time system. However, processing of the exemplar corpus
(2553 documents from “Teksty Drugie”) takes less than 20 minutes. This is due
to the use of a private cloud and proprietary message-oriented engine for
processing texts. We plan to speed up the process, by running larger number of
instances of language tools and by compressing results at each stage. Moreover,
the user is able to start processing from any stage, so the processing time
is shorter when the user plays with different settings.

Further Development

Currently LEM's GUI is developed in cooperation with potential users, literary
scholars working on various types of texts (fiction, journal articles, blog
posts). That is also why we call this software “literary”, because further
development will address the issues pertinent for literary theory, exceeding a
purely linguistic perspective. Some literary-specific issues and functions will
be expanded on the later stage of development, e.g. with adding language tools
for Word Sense Disambiguation and partial analysis of the text structure, like
anaphor resolution and discourse structure recognition. LEM's architecture is
open for such extensions. With that said, in this paper we have focused on the
current stage of development.

LEM will be fully implemented and made available as a web application to the
scholarly audience working on Polish. Next, it will be extended with with tools
for other languages (e.g. English and German). As LEM has a modular
architecture, it would require mostly linking new processing Web Services and
adding converters. LEM has an open licences and we will be happy to share our
tools, code and know-how with teams interested in doing so. Options for
exporting to other formats will be added, so that researchers can easily create
the output in a particular format (list, text, table) and upload it to other
applications (e.g. Mallet) for further processing.