Working paper on canonicity and corpus design parameters in the ELTeC context COST Action CA16204 – WG1 2018-09

Unpublished discussion document prepared for COST Action 16204

Extracted from sampling proposal of the COST Action 16204.

CarolinOdebrecht split up sampling proposal into two documents, add WG3 comments
Sampling criteria for the ELTeC
Outline Introduction On Canonicity Representativeness and balance Literature
Introduction

The task for WG1 is to develop guidelines for data and metadata for the creation of the ELTeC. This task can be split up into several distinct tasks: Guidelines for corpus design, basic annotation and metadata schemes and workflow. This discussion paper focuses on corpus design and metadata because both tasks interplay with each other.

The task for WG3 is to explore theoretical concerns that stem from the application of Distant Reading methods to literary history. WG1's task of designing the corpus guidelines needs to be closely communicated and coordinated with the WG3 in order apply the relevant textual, paratextual, and contextual genre markers of th e novel. The role of WG3 is crucial in formulating the relevant research questions due to its literary-historical and comparative expertise.

This document is a joined working paper on canonicity and corpus design parameters in the context for the ELTeC of WG1 and WG3.

On Canonicity

While a canon is the portrait of someone’s prestigious social, cultural, economic status and it reflects normative self-promotional legitimating and rating decisions. In contrast, a corpus design follows a research question or context and is therefore more research goal oriented. The latter makes it paramount that the hypothesis and research questions be clearly defined. A second important aspect is the way of considering the actual texts. As Moisl (2009: 876) puts it ‘Data is ontologically different from the world.’ So there is a difference between texts in the world and data we create. By texts, we may consider the manifestation or the extension or the work of a text (cf. IFLA 2009). A canon can contain an extension of a certain text which is available in different languages and prints. Ontologically, these different levels of text are different from what a text in a corpus might be (cf. van Zundert and Andrews 2017). This means, that digitization is a kind of annotation, hence interpretation (Odebrecht et al 2017). A representation of a text in a corpus (e.g. transcription, OCR) is the result of interpretation. A corpus design needs to consider this for sampling and digitization issues.

At the end of the Action, ELTeC should contain literary texts (novels) from a distinct period and in several languages. For each language (and thus for each cultural context) there exists a diversity of canons which reflect different, changing, historical perspectives on the notion of both the canon (either as national or as part of world literature) and its counter-canon(s). Each canon is a result of rating texts from different perspectives. The assessment can reflect intellectual rating (a text is a representative of a certain literature period/genre/subgenre, is influential, is important), economical rating (a text is published in more than one print run), or readers rating (a text is most popular within a certain reader group) (cf. Hermann 2011 or Winko 1996). All these ratings can change over time and may also interplay which each other. A canon can therefore reflect different interpretation of ‘famous’, ‘important’ or ‘influential’ texts. These criteria are not overall comparable. For example, texts from a smaller language community such as Czech are less likely to be frequently reprinted than English texts of the same period and genre. Additionally, the international visibility and awareness about particular texts abroad often depends on the socio-cultural influence of the country and their publishing houses.

The criteria derived from a canon are not completely comparable and categorical. Which prestigious group’s canon should be considered, which should be excluded, and why? Are there comparable canons for novels in all countries of the language in question? These questions echo discussions the debates about world literature as circulation of texts from centres to peripheries and chime with the related notions of world literature as a canon of universal masterpieces, both of which deserve a critical examination. Algee-Hewitt and McGurl (2015) show an approach to corpus design based on several canons and which kind of problems occur. Each analysis of the corpus then only shows the different effects of the decisions made by the normative group. Considering national canons is also very difficult and somewhat problematic. An example is the German National Canon of literature which was developed in the 18th century and was promoted by the national educational system until the 1990. Since German reunification, the educational system has not promoted a strict canon and does not recommend a list of books to be read in school or at university (cf. Winko 1996). Thus, taking such types of canons as part of a sampling base of ELTeC would mean reflecting a political and social past of German education and politics. Choosing between canons can then mean choosing between tastes of (current) literature (in past and present) and tastes of past literature when the canon builder rates historical texts. Finally, these canons are not built to be the sampling guidelines for a corpus which we would like to build in the Action.

The MoU of CA16204 formulates the goal as follows: “The main aim and objective of the Action is to develop the resources and methods necessary to change the way European literary history is written.” This goal requires a new approach to corpus design, metadata design and annotation models. As Fowler (2002, 214) puts it: ‘The current canon sets limits to our understanding of literature, in several ways’. Relying on canons will obstruct the Action’s goal in a fundamental way. Canons provide traditional and normative access to the history of literature. In contrast, the Action focuses on new approaches to tell another story. Instead, we might decide that our collection should contain a mixture of works that have never been reprinted since their first appearance, works that have been reprinted a small number of times within one or two decades of their first appearance, and works that have been reprinted in almost every decade since their first appearance.

Therefore, we argue for a non-normative but metadata-based approach of sampling criteria which will follow a corpus design approach. Corpus sampling criteria are mostly oriented/developed by the research question or/and contexts of the corpus creators group. In CA16204, we have neither a distinct research question nor a fixed and previously known corpus creator group. The research context of the Action is more interested in knowledge production in a methodological sense and does not prefer a single method, model or theory. Furthermore, the member group of the Action will fluctuate and consist of researches from different disciplines with different theoretical and cultural contexts. Thus, we need to build the corpus design on a methodical basis. This will enable us to select a certain number of canonical texts as well but also to be more open and inclusive than mainstream literary histories.

Representativeness and balance

Additional to the aspect of ‘prestige’ (canon), the aspect of representativeness is problematic for corpus design. Developing criteria for corpus design means to decide which kind of sample of the world shall be included in the data base. Obviously, including the whole population of 19th century literature in several languages is impossible. So we need to make a compromise between what we would like to have in the corpus (all literature) and what we can put in the corpus (sample). The biggest challenge in this process of selection will be the availability of digitised texts, seeing as the project does not provide funding for new digitisation projects.

It is a truism that there is no such thing as a ‘good’ or a ‘bad’ corpus, because how a corpus is designed depends on what kind of corpus it is and how it is going to be used. (Hunston 2008, 155)

Following Hunston (2008), a corpus design needs to follow the research goal. For the Action, representativeness may be the relationship between the corpus and the body of literature in question. ‘Representativeness refers to the extent to which a sample includes the full range of variability in a population’ (Biber 1993, 243). To say something about the representativeness of a corpus requires knowledge about the whole population of literature (in the period in question). Actually, we don’t know every book of every language published/read/discussed in Europe in the period in question. It is further ‘impossible to identify a complete list of ‘categories’ that would exhaustively account for all texts produced in a given language’, (Hunston 2008, 161) or context. Such categories can refer to various factors such as characteristics of authors, e.g., gender or place of birth, publishers, topics of the texts, readership etc. Against the background of canon building, there is also ‘no true measure of the ‘significance’ of a type of discourse to a community’. The chance, that a corpus represents the whole population of something increases with the size of a corpus. In this way, size and representativeness are connected.

See Biber (1993) and Hunston (2008) for a detailed discussion.

Representativeness is therefore a kind of ideal which we would like to pursue but which cannot be achieved as whole. In line with the MoU, the ELTeC can be designed as a monitor corpus where texts (from different languages and periods) can be added over time.

Balance refers to the internal proportion of the corpus. Note that a fully balanced corpus is an ideal which we only can try to achieve next to the ideal of representativeness (Hunston 2008, 163). According the MoU, the corpus shall contain 2,500 full-texts novels at least in 10 different languages:

Languages: Dutch, English, French, German, Greek, Italian, Polish, Portuguese, Russian, Spanish (ELTeC core) first iteration: 6 subcollections (100 novels per language) 1840 to 1920 starting with British, French, Spanish, German, Greek, Polish second iteration: 4 subcollections (100 novels per language) 1840 to 1920 third iteration: 6 subcollections in additional languages and subcollections for all 16 languages 1780-1839, We will also try to include additional languages such as Hungarian, Serbian, Swiss , Romanian, Czech, Latvian, Norwegian.

In this way, ELTeC is balanced with respect to language. With respect to genre, the corpus is not balanced but homogenous; all texts in the corpus shall be full novels. With respect to time, the corpus design shall focus on the period 1840 to 1920.

Before discussing the criteria in more detail, we would like to ask another methodical question concerning corpus sampling: would we like to use each criterion, with the intention to represent the variety of possible values, or should the sample represent the distribution of those values across the population?

Let’s say we wish to select 100 texts from a population of texts published over a period of (say) 20 decades. We might select five texts from the first decade, five from the second, and so on, making up our 100 titles, evenly spread across the possible decades. The probability that a text in our corpus will come from any given decade will always be the same: 1 in 5. This selection represents the variety of possible values for the criterion. Suppose now that we look more closely at the number of titles from each decade actually available in the population we are sampling. It’s more than likely that this number will vary significantly: for example, we might notice that there are 2000 titles published in decade x, and only 100 in decade y. To represent this population statistically we should therefore make it 20 times more probable that a randomly chosen title will come from decade x than from decade y. Since the total number of titles we can choose is quite small relative to the total number available in the population, strict application of this principle may mean that we cannot choose any titles at all from some decades. This is one reason for preferring to make our sampling represent variety rather than frequency; another is that we cannot choose fractional numbers of titles. When we start considering more than one criterion, the task of ensuring that the numbers in our sample accurately reflect the distribution of all values across the population becomes prohibitively complex.

Following the approach of representing the variety of a population, we then need to decide which criterion is balanced in which way and interplays with other criteria. For example, we may want to choose novels from male and female authors in a balanced way. This may mean that in the total of all novels one half will be from female authors. Without any further regulation, we might have more female authors in one decade than in other decades. If we would like to have an equal number of male and female authors in every decade, we need to link the criterion of the author’s gender with the criterion of time. Doing this, might complicate the selection process (cf. finding novels for this proportion in every decade of the period in question); it may also distort the reflection of the changing share of women authors among novelists, which is in itself impo rtant for the history of the genre. . So we have to decide which categories shall be present in a balanced way in the corpus.

Literature Algee-Hewitt, Mark; McGurl, Mark (2015): Between Canon and Corpus. Six Perspectives on the 20th-Century Novels. Stanford Literary Lab Pamphlet no 8. Biber, Douglas (1993): Representativeness in Corpus Design. In: Literary and Linguistic Computing (8), 243–257. Herrmann, Leonhard (2011): System? Kanon? Epoche? In: Matthias Beilein, Claudia Stockinger und Simone Winko (Hg.): Kanon, Wertung und Vermittlung. Literatur in der Wissensgesellschaft. Berlin: De Gruyter (Studien und Texte zur Sozialgeschichte der Literatur, Bd. 129), S. 59–75. Hunston, Susan (2008): Collection strategies and design decisions. In: Anke Lüdeling und Merja Kytö (Hg.): Corpus Linguistics. An International Handbook. 2 Bände. Berlin: De Gruyter (1), S. 154–168. IFLA (2009): Functional Requirements for Bibliographic Records (Technical Report). Online verfügbar unter http://www.ifla.org/publications/functional-requirements-for-bibliographic-records, zuletzt geprüft am 23.12.2016. Lüdeling, Anke (2011): Corpora in Linguistics. Sampling and Annotation. In: Karl Grandin (Hg.): Going Digital. Evolutionary and Revolutionary Aspects of Digitization. New York: Science History Publications (Nobel Symposium, 147), 220–243. Moisl, Hermann (2009): Exploratory Multivariate Analysis. In: Anke Lüdeling und Merja Kytö (Hg.): Corpus Linguistics. An International Handbook. 2 Bände. Berlin: De Gruyter (2), S. 874–899. Winko, Simone (1996): Literarische Wertung und Kanonbildung. In: Grundzüge der Literaturwissenschaft. Hrsg. v. H. L. Arnold und H. Detering. München, 585–600. van Zundert, Joris; Andrews, Tara L. (2017): Qu'est-ce qu'un texte numérique? A new rationale for the digital representation of text. In: Digital Scholarship in the Humanities (32), S. 78–88. DOI: 10.1093/llc/fqx039.