NIST Scoring Toolkit Version 2.0beta

NIST Scoring Toolkit Version 2.0beta Welcome to the NIST Scoring Toolkit Version 2.0 Beta 1

The NIST Scoring Toolkit (SCTK) is a collection of software tools designed to score benchmark test evaluations of Automatic Speech Recognition (ASR) Systems. The toolkit is currently used by NIST, benchmark test participants, and reserchers worldwide to as a common scoring engine.

See the "Demo" section below to get a quick intro into key concepts>

This version of SCTK contains several programs:

sclite

SCTK has at its core, sclite (Score-Lite), which is a flexible Dynamic Programming alignment engine used to "align" errorful hypothesized texts, such as output from an ASR system, to the correct reference texts. After alignment, sclite generates a veriety of summary as well as detailed scoring reports.

This version of sclite comes bundled with the CMU-Cambridge Statistical Language Modeling Toolkit v2. The toolkit is used to compute word-weights based on an N-gram language model. The directory 'src/slm_v2' contains the complete distribution and is automatically compiled by the installation scripts.

sc_stats

While sclite aligns and scores a single system, sc_stats will compare system performance between more than one system, so long as the systems under test have been ran on identical test data and using an identical test paradigm. Inter-System comparisons are made by running tests paired-comparison statistical significance tests.

rover

Rover - Recognition Output Voting Error Reduction, is a tool which combines an arbitrary number for ASR system outputs into a composite Word Transition network which is then searched an scored to retrieve the best scoring word sequence.

The program is documented in the paper A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) presented at the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding. The paper is also available in postscript.

hubscr

Wrapper script for scoring DARPA STT evaluations rfilter1

Transcription normalization filter engine csrfilt

Transcription normalization wrapper script chfilt

CallHome/Switchboard style transcript filter hamzaNorm

Arabic Hamza+alif transcript normalization filter acomp

Automatic compound word expansion filter def_art

Aribic definate article striping program utf_filt

UTF file format transcription filter

ASR Scoring Demo

So. what is this package for? It's for computing the accuracy of ASR engines that convert recordings of speech into text. We'll use data in ../src/sclite/testdata for the demo. The process to compute the accuracy is:

Record a sample of speech (called an utterance) storing it in a waveform file like a .wav or .mp3. Each recording has a unique utterance id used later.
Manually transcribe the speech to build what we call the 'reference' transcription - some call it the ground truth but the term reference is preferred because it is costly/difficult/impossible to make 100% accurate transctript.
Convert the audio into the 'hypothesized' text with a program like Kaldi . The text is hypothesized to be correct by the system.
Assemble the reference and hypothsis texts into separate files for scoring. For example ../src/sclite/testdata/demo.ref.txt and ../src/sclite/testdata/demo.hyp.txt are example reference and hypothesis transcripts for 2 utterances. Since there a two utterances, the transcript for each is labelled with the utterance id in parens.

Run the scorer with these commands after sclite has been compiled:

cd src/sclite/testdata
../sclite -r demo.ref.txt -h demo.hyp.txt -i spu_id -o sum pra stdout




                     SYSTEM SUMMARY PERCENTAGES by SPEAKER                      

      ,------------------------------------------------------------------.
      |                           demo.hyp.txt                           |
      |------------------------------------------------------------------|
      | SPKR     | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
      |----------+-------------+-----------------------------------------|
      | speaker1 |    2     46 | 84.8   15.2    0.0    2.2   17.4   50.0 |
      |==================================================================|
      | Sum/Avg  |    2     46 | 84.8   15.2    0.0    2.2   17.4   50.0 |
      |==================================================================|
      |   Mean   |  2.0   46.0 | 84.8   15.2    0.0    2.2   17.4   50.0 |
      |   S.D.   |  0.0    0.0 |  0.0    0.0    0.0    0.0    0.0    0.0 |
      |  Median  |  2.0   46.0 | 84.8   15.2    0.0    2.2   17.4   50.0 |
      `------------------------------------------------------------------'


		DUMP OF SYSTEM ALIGNMENT STRUCTURE

System name:   demo.hyp.txt

Speakers: 
    0:  speaker1

Speaker sentences   0:  speaker1   #utts: 2
id: (speaker1-utterance1)
Scores: (#C #S #D #I) 25 0 0 0
REF:  as competition in the mutual fund business grows increasingly intense more players in the industry appear willing to sacrifice integrity in the name of performance 
HYP:  as competition in the mutual fund business grows increasingly intense more players in the industry appear willing to sacrifice integrity in the name of performance 
Eval:                                                                                                                                                                     

id: (speaker1-utterance2)
Scores: (#C #S #D #I) 14 7 0 1
REF:  FOR   A  TWO    TRILLION DOLLAR business built on public confidence this trend is **** DISHEARTENING AT  best and downright dangerous at worst 
HYP:  FREED TO TRYING TO       LURE   business built on public confidence this trend is THIS TIGHTENING    AND best and downright dangerous at worst 
Eval: S     S  S      S        S                                                        I    S             S