--- name: whisper-test description: | Transcribe WAV audio files using OpenAI Whisper for intelligibility testing. Triggers on: "transcribe audio", "whisper test", "test audio output", "is the audio intelligible", "check speech quality", "run whisper", "speech to text test", "check if audio sounds right" --- # Whisper Audio Intelligibility Test Transcribe WAV audio files using OpenAI Whisper and report whether the speech is intelligible. Optionally compare against expected text. ## Setup Whisper is installed as a uv tool: `uv tool install openai-whisper`. Since this machine may lack `ffmpeg`, always use the Python API approach that loads WAV files with scipy (bypasses the ffmpeg requirement). ## Running Transcription Use `uv run --no-project --with openai-whisper --with scipy --python 3.11` to execute the transcription script: ```bash uv run --no-project --with openai-whisper --with scipy --python 3.11 \ python3 ~/.claude/skills/whisper-test/transcribe.py \ [--model tiny|base|small|medium|large-v3] \ [--language en] \ [--expected "expected text"] \ [--json] \ file1.wav [file2.wav ...] ``` ### Arguments - `--model`: Whisper model size (default: `large-v3`). See model selection guide below. - `--language`: Language hint (default: `en`). - `--expected`: Expected transcription text. When provided, calculates Word Error Rate (WER). - `--json`: Output results as JSON instead of human-readable text. - Positional: One or more WAV file paths. ### Model Selection Use `large-v3` for TTS quality verification. Smaller models hallucinate or miss words in synthesized speech, making them unreliable for judging output quality. | Model | VRAM | When to use | | ---------- | ------ | --------------------------------------------------------------- | | `large-v3` | ~10 GB | **Default.** TTS evaluation, quality gating, regression testing | | `medium` | ~5 GB | GPU memory constrained, still decent accuracy | | `small` | ~2 GB | Quick smoke tests only | | `base` | ~1 GB | Not recommended for TTS — high hallucination rate | | `tiny` | ~1 GB | Not recommended for TTS — unreliable | Observed with identical Qwen3-TTS 1.7B voice-cloned output: - `large-v3`: "That's one tank. Flash attention pipeline." (key phrase captured) - `base`: "That's one thing, flash attention pipeline." (close but hallucinated) For poor-quality 0.6B output, `base` hallucinated "Charging Wheel" while `large-v3` gave "Flat, splashes." — honest about the poor quality instead of confabulating plausible words. ### Output Format For each file, prints: ```text filename.wav: transcription: "Hello world, this is a test." duration: 2.96s rms: 0.0866 peak: 0.6832 silence: 49.2% [wer: 0.0%] (if --expected provided) ``` ## Interpreting Results | Transcription | Meaning | | --------------------- | ------------------------------------------------------------------------------- | | Matches expected text | Audio is intelligible and correct | | Partial match | Audio has some speech but quality issues | | Empty string `""` | Audio is unintelligible (noise, silence, or garbage) | | Hallucinated text | Model heard something in noise (common with Whisper, especially smaller models) | ### Audio Quality Indicators - **RMS < 0.01**: Essentially silent - **silence > 80%**: Mostly silence, likely no speech - **peak < 0.05**: Very quiet, may not contain useful audio ### TTS-Specific Patterns Voice-cloned TTS output often has these characteristics: - **Garbled opening, clear ending**: Common with ICL voice cloning on short references. The model needs a few frames to "lock in" to the target voice. - **Key phrases preserved**: Even when WER is high, domain-specific terms (e.g. "flash attention pipeline") often come through clearly. - **Smaller models produce worse audio**: 0.6B models produce significantly less intelligible output than 1.7B — expect Whisper to reflect this. ## Batch Testing (TTS Variant Comparison) When testing multiple TTS outputs against expected text: ```bash uv run --no-project --with openai-whisper --with scipy --python 3.11 \ python3 ~/.claude/skills/whisper-test/transcribe.py \ --expected "Hello world, this is a test." \ variant1.wav variant2.wav variant3.wav ``` This produces a comparison table showing which variants produce intelligible speech. ## Docker / NGC Container Usage When testing on a GPU box inside an NGC container (e.g. for CUDA flash-attn builds), ffmpeg isn't available and apt can be slow. Two workarounds: 1. **Static ffmpeg binary** (fast, no apt): ```bash curl -sL https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-arm64-static.tar.xz \ | tar xJ --strip-components=1 -C /usr/local/bin/ --wildcards "*/ffmpeg" "*/ffprobe" pip install openai-whisper ``` 2. **Use scipy loader** (this script's default — no ffmpeg needed): ```bash pip install openai-whisper scipy python3 ~/.claude/skills/whisper-test/transcribe.py --model large-v3 output.wav ``` The script loads WAV files directly via scipy, bypassing Whisper's ffmpeg dependency entirely. This works for WAV files (the standard TTS output format).