From Jane Austen's original Pride and Prejudice to a graded reader for
L2 learners: a computational study of the processes of text simplification

Emily Franzini

efranzini@etrap.eu

eTRAP Research Group

University of Goettingen, Germany

Marco Buchler

mbuechler@etrap.eu

eTRAP Research Group

University of Goettingen, Germany

Introduction

Authentic text and graded reader

One of the objectives of second language (L2) learning is to be able to read
and understand a variety of texts, from novels to newspaper articles, written
in the language of interest. These texts written with a native audience in mind
are commonly referred to as authentic texts or ”real life texts, not written
for pedagogic purposes” (Wallace, 1992). Authentic texts, however, can present
too many obstacles for L2 learners with too low a level of knowledge.
The complex language structures and advanced vocabulary of these ‘real’ texts
can have the unwanted effect of demotivating the reader (Richard, 2001).
The gap between the learner’s limited L2 knowledge and the fluency of authentic
texts creates an ideal space for graded readers. Graded readers are
’’simplified books written at varying levels of difficulty for second language
learners” (Waring, 2012). Through graded readers original classic works can be
adapted to match the learner’s level of knowledge, thus providing the ideal
tool to tackle ‘real’ themes, narratives and dialogues.

From authentic text to graded reader

One such graded reader is a newly adapted version of Jane Austen’s Pride and
Prejudice (edition of 1813) that one of the authors of this paper wrote
(Franzini, 2016) as part of a collection for learners of English as a foreign
language (EFL).

For authors, the process of adaptation of a text for a learning audience is
complex. In order to simplify the text the author will necessarily have to
make grammatical changes and lexical substitutions following vocabulary lists,
shorten the text by cutting out entire paragraphs and events, and in some
cases eliminate entire chapters and characters. Together with these changes,
which can be defined as ‘structural’ because they are dictated by
hard requirements of length and standardised level of difficulty, the author
will also make a series of judgment calls at a sentence and word level.
These changes, which are here defined as ‘cognitive’, include processes that
are more intangible and that are a consequence of a native author’s ‘feeling’
that the original text is too difficult for learners. These
include elaborating, clarifying, providing context and motivation for
unfamiliar information and non-explicit connections (Beck et al., 1991).

Research Objective

The objective of this study is to computationally analyse the manual process
behind the simplification of a historical authentic text aimed at producing
a graded reader. More specifically, it aims to classify and understand the
structural and cognitive processes of adaptation that a human author, more or
less consciously, is able to perform manually. Do the applied changes follow
strict rules? Can they be classified as forming a pattern? And if so, can they
be reproduced computationally?

Related Research

Researchers have long been addressing the issue of text simplification for a
variety of purposes. A similar study to this was made by Petersen who
compared authentic newspaper articles with abridged versions (Petersen and
Ostendorf, 1991). Similar studies have been made, for example, to create a
reading aid for people with disabilities (Canning, 2000, Allen, 2009). Data

This study considers two sets of data. The first is the entire original novel
(ON) Pride and Prejudice. The second dataset the graded reader (GR) published
by Liberty. The GR has been compressed from the 61 chapters of the ON to 10
chapters. When comparing word tokens, the GR has a size of 12.6% of the ON
(Tab. 1). The language was simplified to match the upper intermediate level B2.
To guide the choice of vocabulary, the author chose to follow the
Lexitronics Syllabus (Lexitronics, 2009).

┌───────────────┬─────────────┬───────────────┬─────────────┬─────────────────┐
│Original Novel │             │               │             │Average sentence │
│               │Line count   │Word tokens    │Word types   │length 24.00     │
│Graded Reader  │5,974 1,115  │143,386 18,086 │6,823 1,813  │                 │
│               │             │               │             │16.22            │
├───────────────┼─────────────┼───────────────┼─────────────┼─────────────────┤
│% GR size in   │18.6%        │12.6%          │26.5%        │67.5%            │
│respect to ON  │             │               │             │                 │
└───────────────┴─────────────┴───────────────┴─────────────┴─────────────────┘

Table 1: Quantitative comparison between data sets

Methodology

Readability

As a first step towards analysing the differences and similarities between an
authentic text and a graded reader, we decided to evaluate if what is published
as a graded reader can computationally be considered a simplified version of
the original. The method chosen to make this investigation was to conduct two
different readability tests, namely the ARI test and the Dale-Chall Index test
on the data. Both tests were designed to gauge the comprehension difficulty of
a text by providing a numeric value, which corresponds to a particular school
level of a native speaker of the language tested.

The results show that both tests yield similar scores and satisfy the
hypothesis that this particular GR can be computationally proven to be, in
terms of ‘understandability’, a simplification of the ON.

┌─────────────────────┬───────────────────────┬───────────────────────────────┐
│                     │ARI                    │                               │
│Original Novel Graded│                       │Dale-Chall 14-16 year          │
│Reader               │14-15 year olds 11-12  │olds 11-13 year olds           │
│                     │year olds              │                               │
└─────────────────────┴───────────────────────┴───────────────────────────────┘

Table 2: Age level of text understandability

Difference Analysis

In order to analyse the process of adaptation, a difference analysis was
conducted by considering both those elements that changed from the ON to the
GR, and those that, by contrast, remained the same. The analysis is structured
into chapters, sentences and words, so as to proceed in order from the largest
unit of text to the smallest.

When adapting a text, whether it is for a graded reader, a play or a film, the
rationale behind the selection of certain parts over others is
normally content-based. Here the author selected the most dynamic parts of the
novel, which included dialogues, moments of suspense, movements of the
characters and revelations. The selection of some scenes of the plot over
others is purely a 'cognitive' choice of the author because it is entirely
subjective. However, by using a text reuse detection software (TRACER) on both
texts it was possible to visualise where the majority of reuses occur. These
concentrate in particular around the beginning and the end of the novel (dark
green in Fig. 1).

«8»«

[301-1]


Figure 1: Dotplot visualisation of the reuses between the ON and the GR. The
longer X-axis represents the larger original novel, the Y-axis the smaller GR.
The darker the dot, the closer the similarities between the two datasets

‘Structural’ changes made at a sentence level present patterns that can be more
systematically identified. For example, by comparing sentence length, it was
noted that on average the ON contains longer sentences (24 words) than the GR
(16.22 words) (Fig. 2). Though this might seem like an obvious result,
it appears less so when one thinks that, in order to simplify a concept for a
language learner, it is often necessary to use additional words to elaborate or

[301-2]

Figure 2: Sentence length distribution. The X-axis represents the number of
words per sentence; the Y-axis is the probability of sentences of a specific
length occurring in

the texts

In order to conduct a difference analysis on the smallest unit of text - the
word - we looked at all the words that appear frequently in the ON, but that
never appear in the GR, in order to understand what kinds of words the author
found necessary to drop.

┌────────────┬─────────┬────────────┬─────────┐
│Word        │Frequency│Word        │Frequency│
├────────────┼─────────┼────────────┼─────────┤
│upon        │75       │table       │31       │
├────────────┼─────────┼────────────┼─────────┤
│least       │65       │astonishment│30       │
├────────────┼─────────┼────────────┼─────────┤
│acquaintance│63       │fancy       │30       │
├────────────┼─────────┼────────────┼─────────┤
│either      │59       │attempt     │29       │
├────────────┼─────────┼────────────┼─────────┤
│whose       │59       │dine        │29       │
├────────────┼─────────┼────────────┼─────────┤
│dare        │53       │beg         │28       │
├────────────┼─────────┼────────────┼─────────┤
│regard      │53       │depend      │28       │
├────────────┼─────────┼────────────┼─────────┤
│determine   │47       │highly      │28       │
├────────────┼─────────┼────────────┼─────────┤
│scarcely    │45       │satisfaction│28       │
├────────────┼─────────┼────────────┼─────────┤
│ladyship    │42       │acknowledge │27       │
├────────────┼─────────┼────────────┼─────────┤
│former      │38       │credit      │27       │
├────────────┼─────────┼────────────┼─────────┤
│put         │36       │thus        │27       │
├────────────┼─────────┼────────────┼─────────┤
│amiable     │35       │disposition │26       │
├────────────┼─────────┼────────────┼─────────┤
│deal        │34       │exceedingly │26       │
├────────────┼─────────┼────────────┼─────────┤
│design      │32       │praise      │26       │
├────────────┼─────────┼────────────┼─────────┤
│satisfy     │32       │pray        │26       │
├────────────┼─────────┼────────────┼─────────┤
│society     │32       │wholly      │26       │
└────────────┴─────────┴────────────┴─────────┘

Table 3: Words that appear only in the ON

Table (3) shows that 14 out of the 34 words listed (ca. 35%) are too advanced
for level B2. Some of the other words, though accessible to B2 learners, were
replaced with easier synonyms. We also conducted an analysis on parts of speech
and how they differ in the two data sets (Tab. 4).

┌──────────────────────────┬─────────────────┬──────────────┬─────────────────┐
│PoS                       │More frequent in │Similar       │More frequent in │
│                          │ON               │frquenev      │GR               │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│JJS adjective, superlative│X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│JJR adjective, comparative│X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│PDT predeterminer         │X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│KBS adverb, superlative   │X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│WDT WH-dctcrminer         │X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│FW foreign word           │X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│: colon                   │X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│WPS WH-pronoun, posses-   │X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│sive                      │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│NNPS noun, proper, plural │X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│SYM symbol                │X                │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│RP particle               │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│RB adverb                 │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│VB verb, base form        │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│TO ’to’ as preposition    │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│JJ adjective or numeral,  │                 │X             │                 │
│ordi-                     │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│nal                       │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│NNS noun, proper, singular│                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│CC conjunction,           │                 │X             │                 │
│coordinating              │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│PRPS prounoun, possessive │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│NN noun, common, singular │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│MD modal auxiliary        │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│IN preposition or         │                 │X             │                 │
│conjuction.               │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│subordinating             │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│DT determiner             │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│VBN verb, past participle │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│VBG verb, present         │                 │X             │                 │
│participle                │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│POS genitive marker       │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│RBR adverb, comparative   │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│EX existential 'there'    │                 │X             │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│UH interjection           │                 │              │X                │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│NNP noun, proper, plural  │                 │              │X                │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│WRB WH-advcrb             │                 │              │X                │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│VBD verb, past tense      │                 │              │X                │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│VBP verb, present tense,  │                 │              │X                │
│not                       │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│3rd person singular       │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│VBZ verb, present tense.  │                 │              │X                │
│3rd                       │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│person singular           │                 │              │                 │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│WP WH-pronoun             │                 │              │X                │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│CD numeral, cardinal      │                 │              │X                │
├──────────────────────────┼─────────────────┼──────────────┼─────────────────┤
│PRP prounoun, personal    │                 │              │X                │
└──────────────────────────┴─────────────────┴──────────────┴─────────────────┘

Table 4: Parts of speech frequency in the ON vs. in the GR. Note the presence
of comparative and superlative adjectives in the ON, which are totally absent
from the GR

Conclusions and further research

This study is a first step into the realm of text simplification and adaptation
regarding graded readers for L2 learners. By conducting a difference analysis
between the two texts, it was observed that at plot level the selection of
scenes has no impact on the difficulty of a text. The text reuse detection
software used, however, identified which parts of the plot have been preserved
and which have been eliminated for the sake of a consistent, yet shorter, story
line. It was observed that the beginning and the end of the novel were the
parts that were adapted most faithfully.

The identification of reuse over the whole novel was also a step towards
pinpointing where sentences were reused verbatim and where they were not. Where
the sentences have undergone heavy changes, we can observe to what extent they
were modified, how and why. At a sentence level, we noted that reducing the
length of the sentences is a successful simplification strategy. A further
study would have to be conducted to best understand how sentences were split or
reduced, and consequently how the syntax of a sentence was affected by its
shortening.

At a word level, the simplification of the text appeared to be dictated by the
elimination and replacement of difficult vocabulary and certain parts of
speech, such as comparative and superlative adjectives. The word length does
not appear to be an indicator of difficulty. While it was observed that
both the readability tests were based on sentence length as a parameter, only
the ARI test, however, considers word length as another parameter. A test on
the word-length distribution of the ON versus the GR shows that, in this case,
the word length bears no importance in assessing the difficulty of a text.
Further research would have to be conducted in order to learn if it is easier
for an L2 learner to remember a word not because of its length, but because of
its repeated presence in a text. The insights gained from this study will be
useful in future work on automating the simplification process.