Converted from a Word document
The possibilities for widening the spectrum of research questions by adopting new computational methodology seem to be almost unlimited for literary scholars with considerable programming skills. Researchers with little or no such skills, however, have to rely on user-friendly tools. Simple word counts are still among the most common, and admittedly often very useful features used in computational text analysis. Usually, linguistic annotations are needed for using more complex features in the analysis of style or content of a literary text. For example, a researcher might want to investigate style in terms of syntactic preferences by applying stylometric analysis on part-of-speech tag n-grams, to run topic modelling on specific word types only or to characterize the way an author describes figures by extracting all the adjectives that refer to a named entity. All these examples require of the scholar to first extract linguistic information from the text and use that information to define complex features.
Computer linguists have developed several tools for the various tasks of natural language processing (NLP) that can automatically analyze a digital text and annotate it with such information. In the present spectrum of solutions for NLP tasks, there is a gap between tools for rather simple tasks and full programming frameworks which require significant programming skills. The one end of the spectrum is represented by WebLicht,
DKPro provides a programming framework in which many such NLP tools can be combined into an analysis pipeline. The pipelining approach is especially useful, often even necessary, when one NLP tool needs the annotations provided by another NLP tool in advance for extracting more complex linguistic features. DKPro thus provides access to tools like sentence splitters, tokenizers, part-of-speech taggers, named-entity recognizers, lemmatizers, morphological analyzers and parsers in many languages.
While making NLP significantly easier by integrating many NLP tools into a single framework, the use of DKPro still requires a substantial knowledge of technologies like UIMA, Java and Maven. To further lower the skill threshold for literary scholars to use complex NLP output in computational text analysis, DARIAH-DE (the German branch of the European project Digital Research Infrastructure for the Arts and Humanities, funded by the German Federal Ministry of Education and Research) developed the DARIAH-DKPro-Wrapper (DDW).
Furthermore, the DDW stores its output in a tab-separated plain text format inspired by CoNLL2009.
Scripts connecting the output format to popular text analysis tools like the R package
https://rawgit.com/DARIAH-DE/DARIAH-DKPro-Wrapper/master/doc/tutorial.html
stylo
Digital Humanities 2013: Conference Abstracts. 2013. For the software see:
https://sites.google.com/site/computationalstylistics/stylo
This poster will present the DDW and its file format as a new and comfortable means of providing linguistic annotations, thus significantly lowering the threshold for using complex NLP-based features in computational literary analysis.