options
The inputs to "sclite" are the
reference file and a hypothesis file(s), the text portions of which
may be either ASCII characters or GB encoded Chinese characters.
There are a number of different input formats permitted:
"trn",
"txt",
"stm", and
"ctm".
As new scoring paradigms were created for the ARPA
tests, accompanying formats were created to support the evaluations.
trn - Definition of a transcript input file
The transcript format is a file of word sequence
records separated by newlines. Each record contains a
word sequence, follow by the an utterance ID enclosed
in parenthesis. See the '-i' option for a list of
accepted utterance id types.
example.
she had your dark suit in greasy wash water all
year (cmh_sa01)
Transcript alternations can be used
in the word sequence by using this recursive BNF format:
ALTERNATE :== "{" TEXT ALT+ "}"
ALT :== "/" TEXT
TEXT :== 1 or more whitespace separated words |
"@" | ALTERNATE
The "@" represents a NULL word in the transcript. For
scoring purposes, an error is not counted if the "@" is
aligned as an insertion.
example
i've { um / uh / @ } as far as i'm concerned (cmh_sa02)
Words can be marked as optional by surrounding a given word with parens. For example: the word /farmer/ is optional. Note that the parser differentiates the parens applied to text vs. the utterance ID.
I am a (farmer) (cmh_sa03)
txt - Definition of a text input file
This format is simply free-form text with no page,
paragraphs, sentence or speaker breaks.
stm - Definition of segment time mark input file
This describes the segment time marked files to be used for
scoring the output of speech recognizers via the NIST
sclite() program. This is a reference file format.
The segment time mark file consists of a concatenation of
text segment records from a waveform file. Each record is
separated by a newline and contains: the waveform's filename
and channel identifier [A | B], the talkers id, begin and
end times (in seconds), optional subset label and the text
for the segment. Each record follows this BNF format:
STM :== <F> <C> <S> <BT> <ET> [ <LABEL> ] transcript . . .
Where :
<F>
The waveform filename. NOTE: no pathnames or
extensions are expected.
<C>
The waveform channel. The text of the waveform channel
is not restricted by sclite. The text can be any text string without
witespace so long as the matching string is found in both the reference
and hypothesis input files.
<S>
The speaker id, no restrictions apply to this
name.
<BT>
The begin time (seconds) of the segment.
<ET>
The end time (seconds) of the segment.
<LABEL>
A comma separated list of subset identifiers
enclosed in angle brackets. Ex. "<O,F,00>". See
"USING STM FORMAT FOR LABELED UTTERANCE REPORTS
(LUR)" below.
transcript
The transcript can take on two forms: 1) a whitespace separated list of
words, or 2) the string "IGNORE_TIME_SEGMENT_IN_SCORING".
The list of
words can contain an transcript alternation using the following
BNF format:
ALTERNATE :== "{" <text> ALT+ "}"
ALT :== "/" <text>
TEXT :== 1 thru n words | "@" | ALTERNATE
The "@" represents a NULL word in the transcript. For scoring
purposes, an error is not counted if the "@" is aligned as an
insertion.
Example: "i've { um / uh / @ } as far as i'm concerned"
Words can be marked as optional by surrounding a given word with paren. For example, /farmer/ is optional.
When the string "IGNORE_TIME_SEGMENT_IN_SCORING" is used as the transcript,
the process which chops the hypothesis file to matching reference segments
ignores all hypothesis words whose time-midpoints occur within the reference
segments beginning
and ending time. The effect is to declare this segments regions as
"out-of-bounds" for scoring, thus generation no errors from that time
region.
NOTE: this only works with DP alignment of a referenc stm file
and hypothesis ctm file.
Example STM file:
;; comment
2345 A 2345-a 0.10 2.03 uh huh yes i thought
2345 A 2345-b 2.10 3.04 dog walking is a very
2345 A 2345-a 3.50 4.59 yes but it's worth it
The file must be sorted by the first and second columns in
ASCII order, and the fourth in numeric order. The UNIX sort
command: "sort +0 -1 +1 -2 +3nb -4" will sort the words
into appropriate order.
Lines beginning with ';;' are considered comments and are
ignored. Blank lines are also ignored.
USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR):
Motivation:
For the Fall '95 ARPA CSR Evaluation, it was desirable
to not only report overall error-rate statistics but
also error-rate statistics for arbitrary partitions
and/or groups of partitions within the test set. To
this end, the STM file format was extended to encode
arbitrary subset information for each segment.
Usage:
The subset information is encoded by adding two types
of information into the STM file. The first information
type, is a special comment line, the subset information line, (SIL).
The SIL defines the subset's label
id, a short column heading and a description. The special
comment line format is:
;; LABEL "<ID>" "<COL_HD>" "<DESC>"
where:
<ID>
The subset id. Used to label each segment
that belongs to the subset. The format is
arbitrary, but without spaces.
<COL_HD>
Used as column headings in generated reports.
Format is arbitrary.
<DESC>
Used for subset descriptions in generated
reports. May be of arbitrary length and for-
mat. Double backslashes '\\' add a line
feed.
The order of the SIL lines in the STM file defines the
order of subset presentation the generated reports.
The second type of information incorporated into the
STM file is an optional sixth field to the text segment
record. The field consists of a comma separated list
of subset ids enclosed in angle brackets. Each unique
id must have a special comment line, specified above,
to be properly interpreted. Otherwise the id will be
ignored.
Each position within the label field, separated by a
commas, defines a group of subsets that are presented
separately in the generated reports. So for instance,
the first group might be all segments, and the second
might be either male or female, and the third might be
the story. The example below shows an STM file encoded
with this information.
;; LABEL "M" "Male" "Male Talkers"
;; LABEL "F" "Female" "Female Talkers"
;; LABEL "01" "Story 1" "Business news"
;; LABEL "00" "Not in Story" "Words or Phrases not
contained in a story"
940328 1 A 4.00 18.10 <O,F,00> FROM LOS ANGELES
940328 1 B 18.10 25.55 <O,M,01> MEXICO IN TURMOIL
ctm - Definition of time marked conversation scoring input
This describes the time marked conversation input files to
be used for scoring the output of speech recognizers via the
NIST sclite() program. Both the reference and hypothesis
input files can share this format.
The ctm file format is a concatenation of time mark records
for each word in each channel of a waveform. The records
are separated with a newline. Each word token must have a
waveform id, channel identifier [A | B], start time, dura-
tion, and word text. Optionally a confidence score can be
appended for each word. Each record follows this BNF for-
mat:
CTM :== <F> <C> <BT> <DUR> <WORD> [ <CONF> ]
Where :
<F> ->
The waveform filename. NOTE: no pathnames or
extensions are expected.
<C> ->
The waveform channel. Either "A" or "B". The text of the waveform channel
is not restricted by sclite. The text can be any text string without
witespace so long as the matching string is found in both the reference
and hypothesis input files.
<BT> ->
The begin time (seconds) of the word, measured
from the start time of the file.
<DUR> ->
The duration (seconds) of the word.
<WORD> ->
The text of the word. This could be a normal word, a fragment /she-/, an optional word /(she)/, or an optional fragment /(she-)/.
<CONF> ->
Optional confidence score. It is proposed that
this score will be used in the future.
The file must be sorted by the first three columns: the
first and the second in ASCII order, and the third by a
numeric order. The UNIX sort command: "sort +0 -1 +1 -2
+2nb -3" will sort the words into appropriate order.
Lines beginning with ';;' are considered comments and are
ignored. Blank lines are also ignored.
Included below is an example:
;;
;; Comments follow ';;'
;;
;; The Blank lines are ignored
;;
7654 A 11.34 0.2 YES -6.763
7654 A 12.00 0.34 YOU -12.384530
7654 A 13.30 0.5 CAN 2.806418
7654 A 17.50 0.2 AS 0.537922
:
7654 B 1.34 0.2 I -6.763
7654 B 2.00 0.34 CAN -12.384530
7654 B 3.40 0.5 ADD 2.806418
7654 B 7.00 0.2 AS 0.537922
:
For CTM reference files, a format extension exists to permit
marking alternate transcripts. The alternation uses the
same file format as described above, except three word
strings, "<ALT_BEGIN>", "<ALT>" and "<ALT_END>", are used to
delimit the alternation. Each tag is treated as a word,
with a conversation id, channel and "*"'s for the begin and
duration time. The alternation tags are non-recursive and may not be embeddded.
(The STM format does support recursive alternation.)
The alternation is begun using the word "<ALT_BEGIN>", and
terminated using the word "<ALT_END>". In between the start
and end, are at least 2 alternative time-marked word
sequences separated by the word "<ALT>". Each word sequence
can contain any number of words. An empty alternative sig-
nifies a null word.
Below is and example alternate reference transcript for the
words "uh" and "um".
;;
7654 A * * <ALT_BEGIN>
7654 A 12.00 0.34 UM
7654 A * * <ALT>
7654 A 12.00 0.34 UH
7654 A * * <ALT_END>