Unpublished draft for discussion
A born digital document drafted in TEI format by LB
This reference document defines the encoding scheme to be used for the European Literary Text Collection (ELTeC) which will be a major deliverable of COST Action 16204,
Open Question.
The MoU for the project points out that Distant Reading methods cover a wide range
of computational methods for literary text analysis, such as authorship
attribution, topic modelling, character network analysis, or stylistic
analysis.
The focus of the ELTeC encoding scheme is thus not to represent
texts in all their original complexity of structure or appearance, but rather to
facilitate a richer and better-informed distant reading than a transcription of its
lexical content alone would permit. For example, it seems useful to distinguish
headings and annotations from the rest of the text, and to be able to locate
stretches of text within gross structural features such as pages, chapters, or paragraphs.
Although it may be useful to distinguish passages belonging to different narrative
levels (for example, direct speech versus narrative or quotation versus narrative),
it is difficult to do so automatically with any degree of consistency.
It is certainly less useful to record
exact nuances of rendition or spelling in a particular version of a text. Our goal is
thus not to duplicate the work of scholarly editors or to produce (yet another) digital
edition of a specific source document. Rather it is to ensure that the ELTeC texts can be
processed by very simple minded (but XML-aware) systems primarily concerned with
lexis and to make life easier for the developers of such systems.
In selecting features for inclusion in the markup scheme, we have been guided, but not limited, by existing practice as far as possible. Our main goal has been to identify a small core set of textual features which can be readily (preferably automatically) identified in existing digital transcriptions, or easily and consistently provided by new transcriptions.
We distinguish three
This document lists all the textual features which are to be distinguished in an ELTeC conformant transcription at one of these three levels. Whenever a given feature exists in a text, it will be marked up as indicated here. No other features will be captured by the markup: if some textual feature not provided for here is identified by a marked up source text, that markup will be removed (though it may be retained in a version of the text encoded at a different level).
All ELTeC documents are TEI conformant, and therefore include a TEI Header, as discussed
in section
The basic unit of the ELTeC corpus is the text of a single novel, represented by a
TEI
To facilitate checking of a transcription against its source during
production, the
As well as a titlepage or a table of contents, a published novel often includes material such as forewords or appendixes in addition
to the text of the novel itself. This liminal
.
At
level zero, titlepages and tables of contents are omitted. At level one, they are
replaced by a
The Prague decision list says that we decided to exclude titlepages, tables of contents, errata list etc, but to include prefaces, introductions, afterwords, and appendixes, provided these are contemporary with the text. It also says to include footnotes and commentary, but does not specify whether these should also be contemporary, nor how the encoder can easily determine whether or not something is "contemporary".
Within the body of a text, major structural divisions (parts, sections, chapters
etc.) will be captured using the generic
The names used for hierarchic structural divisions of a novel above the chapter are
arbitrary, culture-specific, and often inconsistent : in some novels things called
part
contain things called book
and in others the reverse. We
propose to follow TEI in using a single element (chapter
.
Is it useful to retain the name used for each level in
the original source (the type of div) ?
level1
, level2
etc.)
This issue was not discussed in Prague. Proposal is to use (and enforce) a predefined list of specific values.
The (human) language in which a text is expressed is indicated explicitly by the
Should passages exhibiting regional or dialectal
variation be specially signalled?
In Prague there was some support for using either
A single reference scheme will be defined for the whole corpus, with the following
components:
The identifier will be supplied as the value of an FR042
FR042012
is the twelfth chapter of the 42nd French novel.
Note that these identifiers will not necessarily correspond with the numbering used
in a particular source text. In a work where the first twelve chapters are considered
to form part one, and the next twelve constitute part two, the first chapter of the
second part will have an identifier ending 013
, even though it may be
numbered 1
in a source text.
No dissent from this proposal in Prague
is it important to preserve the original numbering,
particularly for deeply structured texts?
Not explicitly addressed in Prague. Proposal is not to retain original numbering.
The chapters of a novel mostly consist of prose, arranged in paragraphs, for which we
will use the TEI
how should material other than running prose and
dialogue be encoded?
In Prague, we decided to suppress annotation of linebreaks, lists,
tables, figures, captions of figures, typographic information, pagebreaks,
and quotation (i.e. direct/indirect speech). We explicitly agreed to annotate
only paragraphs, divisions, and headings. Other features would be represented either
by a
Novels are also full of direct speech, represented using various different
conventions, but almost always distinguished from the narrative voice. The first
person narrative is also common, but may be regarded as a special case.
How exactly
different narrative strands are articulated in a novel, and the extent to which they
may be characterised by their lexis has been a preoccupation of many distant
reading
style analyses. It might therefore be helpful to distinguish material
purporting to be direct speech from material purporting to be narrative in our basic
encoding, though to do so consistently and accurately may occasionally be
problematic.
Should passages presented as direct speech in a novel be
distinguished from passages presented as narrative?
In Prague the majority view was not to attempt to do more than preserve existing punctuation.
Printed texts typically deploy a number of conventions which can cause problems for linguistic analyses of even the most basic kind. Changes of font or style (italicization or use of superscript, for example) can have particular lexical significance which should be taken into account. End-of-line hyphenation can make it harder to identify the exact form of a token. Non-standard (i.e. non-modern) spellings can mislead parsers. Our proposed encoding aims above all for consistency and transparency in what is reliably achievable, leaving more difficult and problematic issues to be addressed by linguistic annotations.
We do not preserve the lineation of running prose in our source texts, since this is always purely an artefact of the source edition. For the same reason we will reassemble words broken across a line break, silently removing any hyphen present. (This will make it impossible to use our texts for hyphenation studies. So be it.)
: Should page breaks in the source text be preserved ?
Prague decision (as noted above) was to suppress page numbering; however the proposal is to retain it, since it will always be available for OCR texts, where it is essential information during text validation, and is usually available in other digital versions. The discussion in Prague concerned only its lack of utility during the analysis stage, but it is very useful during the transcription and validation stage.
Font and style variations in the source text usually signal something. Italics may
signal emphasis, quotation, foreign language terms etc. Superscripts almost always
signal abbreviation. The visual salience of these variations is of considerably less
interest to distant readers than the intended function they signal. However, it is
not always easy to determine that function reliably and consistently by algorithm.
Some simple cases could however be addressed. A possibly strategy is outlined below.
It assumes the existence of a digital version of the text in which visual features
are explicit, whether by means of TEI-style markup or styling information such as
that provided by Word.
<abbr>14e</abbr>
Prague decision (as noted above) : was to suppress all encoding of renditional features.
: Is it feasible or useful to recode highlighted spans of
text in this way?
Whichever solution is adopted, it should be applied uniformly across the ELTeC. A collection in which some texts make distinctions ignored by others is unsatisfactory.
This section will provide a checklist of TEI elements used in the body of each ELTeC text, with descriptions and examples of their intended applications.
This section describes the metadata associated with each text (title, authorship, date etc.) and with the collection as a whole. The intention is to provide this in a standardised way to facilitate subsetting of the collection, using (for example) coded values for the descriptive selection criteria associated with the text. As far as possible, our text should represent the first complete printed edition of each novel selected.
The TEI Header provides a very large number of possibilities for encoding such metadata. We will provide a checklist of the TEI Header elements which are always to be provided for each text, possibly in the form of a template. As in the body of the text, the intention is to provide a guaranteed minimal level of information, consistent across all parts of the ELTeC.
Note that metadata may be supplied at (at least) two levels: the level of the ELTeC as a whole, and that of individual texts within it. Information which applies uniformly to all parts of the collection should be supplied in the ELTeC header; information specific to a particular document in the text header.
Here is an example template for an individual text header
Within the
The
Taking these in turn, the
In addition to one or more
Here is an example :
The
The Incorporated into the ELTeC
The
The
The
The TEI
The
: should we invent our own taxonomy, use a pre-existing one, make no attempt to constrain or predefine terms used here?
The
Since our selection and descriptive criteria are likely to be specific to the
project, we will probably have to define them in the corpus header using the
following elements:
The
The TEI allows for the specification of encoding practice, by which is meant documentation of the specific editorial policies followed during transcription (treatment of printed hyphens, lexical normalisation, sampling procedures, features included, ignored, or normalised, etc.). Such specification may be supplied at the individual document level, or once for all across the whole of a corpus. It is even possible to specify that different parts of a document follow different policies, provided that all the available policies are defined somewhere.
: We propose as far as possible not to allow for any variation in encoding policies applied within the ELTeC. We will still need to determine our encoding policies, of course, and to document them appropriately in the ELTeC corpus header, but there should be no need for separate specifications at the document level.
Additional markup facilities will be needed to represent more sophisticated annotations, which may be motivated linguistically (for example, to provide a normalised form, part of speech, etc.) or semantically (for example to distinguish proper names, names of people, places, events, etc.).
Sources consulted