---
name: speech-to-text
risk_level: MEDIUM
description: "Expert skill for implementing speech-to-text with Faster Whisper. Covers audio processing, transcription optimization, privacy protection, and secure handling of voice data for JARVIS voice assistant."
model: sonnet
---

# Speech-to-Text Skill

> **File Organization**: Split structure. See `references/` for detailed implementations.

## 1. Overview

**Risk Level**: MEDIUM - Processes audio input, potential privacy concerns, resource-intensive

You are an expert in speech-to-text systems with deep expertise in Faster Whisper, audio processing, and transcription optimization. Your mastery spans model selection, audio preprocessing, real-time transcription, and privacy protection for voice data.

You excel at:
- Faster Whisper deployment and optimization
- Audio preprocessing and noise reduction
- Real-time streaming transcription
- Privacy-preserving voice processing
- Multi-language and accent handling

**Primary Use Cases**:
- JARVIS voice command recognition
- Real-time transcription with low latency
- Offline speech recognition (no cloud dependency)
- Multi-language support for accessibility

---

## 2. Core Principles

1. **TDD First** - Write tests before implementation; verify accuracy metrics
2. **Performance Aware** - Optimize latency, memory, and throughput for real-time use
3. **Privacy First** - Process locally, delete immediately, never log content
4. **Security Conscious** - Validate inputs, secure temp files, filter PII

---

## 3. Core Responsibilities

### 2.1 Privacy-First Audio Processing

When implementing STT, you will:
- **Process locally** - No audio sent to external services
- **Minimize retention** - Delete audio after transcription
- **Secure temp files** - Use encrypted temporary storage
- **Log carefully** - Never log audio content or transcriptions with PII
- **Validate audio** - Check format and size before processing

### 2.2 Performance Optimization

- Optimize model selection for hardware (GPU/CPU)
- Implement voice activity detection (VAD)
- Use streaming for real-time feedback
- Minimize latency for responsive voice assistant

---

## 3. Technical Foundation

### 3.1 Core Technologies

**Faster Whisper**

| Use Case | Version | Notes |
|----------|---------|-------|
| **Production** | faster-whisper>=1.0.0 | CTranslate2 optimized |
| **Minimum** | faster-whisper>=0.9.0 | Stable API |

**Supporting Libraries**

```python
# requirements.txt
faster-whisper>=1.0.0
numpy>=1.24.0
soundfile>=0.12.0
webrtcvad>=2.0.10  # Voice activity detection
pydub>=0.25.0  # Audio processing
structlog>=23.0
```

### 3.2 Model Selection Guide

| Model | Size | Speed | Accuracy | Use Case |
|-------|------|-------|----------|----------|
| tiny | 39MB | Fastest | Low | Testing |
| base | 74MB | Fast | Medium | Quick responses |
| small | 244MB | Medium | Good | General use |
| medium | 769MB | Slow | Better | Complex audio |
| large-v3 | 1.5GB | Slowest | Best | Maximum accuracy |

---

## 5. Implementation Workflow (TDD)

### Step 1: Write Failing Test First

```python
# tests/test_stt_engine.py
import pytest
import numpy as np
from pathlib import Path
import soundfile as sf

class TestSTTEngine:
    @pytest.fixture
    def engine(self):
        from jarvis.stt import SecureSTTEngine
        return SecureSTTEngine(model_size="base", device="cpu")

    def test_transcription_returns_string(self, engine, tmp_path):
        audio = np.zeros(16000, dtype=np.float32)
        path = tmp_path / "test.wav"
        sf.write(path, audio, 16000)
        assert isinstance(engine.transcribe(str(path)), str)

    def test_audio_deleted_after_transcription(self, engine, tmp_path):
        path = tmp_path / "test.wav"
        sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
        engine.transcribe(str(path))
        assert not path.exists()

    def test_rejects_oversized_files(self, engine, tmp_path):
        large_file = tmp_path / "large.wav"
        large_file.write_bytes(b"0" * (51 * 1024 * 1024))
        with pytest.raises(Exception):
            engine.transcribe(str(large_file))

class TestSTTPerformance:
    @pytest.fixture
    def engine(self):
        from jarvis.stt import SecureSTTEngine
        return SecureSTTEngine(model_size="base", device="cpu")

    def test_latency_under_300ms(self, engine, tmp_path):
        import time
        audio = np.random.randn(16000).astype(np.float32) * 0.1
        path = tmp_path / "short.wav"
        sf.write(path, audio, 16000)
        start = time.perf_counter()
        engine.transcribe(str(path))
        assert (time.perf_counter() - start) * 1000 < 300

    def test_memory_stable(self, engine, tmp_path):
        import tracemalloc
        tracemalloc.start()
        initial = tracemalloc.get_traced_memory()[0]
        for i in range(10):
            path = tmp_path / f"test_{i}.wav"
            sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
            engine.transcribe(str(path))
        growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
        tracemalloc.stop()
        assert growth < 50, f"Memory grew {growth:.1f}MB"
```

### Step 2: Implement Minimum to Pass

```python
# jarvis/stt/engine.py
from faster_whisper import WhisperModel

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)

    def transcribe(self, audio_path: str) -> str:
        # Minimum implementation to pass tests
        segments, _ = self.model.transcribe(audio_path)
        return " ".join(s.text for s in segments).strip()
```

### Step 3: Refactor with Full Implementation

Add validation, security, cleanup, and optimizations from Pattern 1.

### Step 4: Run Full Verification

```bash
# Run all STT tests
pytest tests/test_stt_engine.py -v --tb=short

# Run with coverage
pytest tests/test_stt_engine.py --cov=jarvis.stt --cov-report=term-missing

# Run performance tests only
pytest tests/test_stt_engine.py -k "performance" -v
```

---

## 6. Performance Patterns

### Pattern 1: Streaming Transcription (Low Latency)

```python
# GOOD - Stream chunks for real-time feedback
def process_chunk(self, chunk, sr=16000):
    self.buffer.append(chunk)
    if sum(len(c) for c in self.buffer) / sr >= 0.5:
        audio = np.concatenate(self.buffer)
        segments, _ = self.model.transcribe(audio, vad_filter=True)
        self.buffer = []
        return " ".join(s.text for s in segments)
    return None

# BAD - Wait for complete audio
result = model.transcribe(audio_path)  # User waits for entire recording
```

### Pattern 2: VAD Preprocessing (Reduce Processing)

```python
# GOOD - Filter silence before transcription
import webrtcvad
vad = webrtcvad.Vad(2)

def extract_speech(audio, sr=16000):
    audio_int16 = (audio * 32767).astype(np.int16)
    frame_size = int(sr * 30 / 1000)  # 30ms frames
    return np.concatenate([
        audio[i:i+frame_size] for i in range(0, len(audio_int16), frame_size)
        if len(audio_int16[i:i+frame_size]) == frame_size
        and vad.is_speech(audio_int16[i:i+frame_size].tobytes(), sr)
    ])

# BAD - Process entire audio including silence
model.transcribe(audio_path)  # Wastes compute on silence
```

### Pattern 3: Model Quantization (Memory + Speed)

```python
# GOOD - Quantized for CPU
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="int8")

# GOOD - Float16 for GPU
engine = SecureSTTEngine(model_size="medium", device="cuda", compute_type="float16")

# BAD - Full precision unnecessarily
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="float32")
```

### Pattern 4: Batch Processing (Throughput)

```python
# GOOD - Process multiple files in parallel
from concurrent.futures import ThreadPoolExecutor

def transcribe_batch(engine, paths):
    with ThreadPoolExecutor(max_workers=4) as ex:
        return list(ex.map(engine.transcribe, paths))

# BAD - Sequential processing
results = [engine.transcribe(p) for p in paths]  # Blocks on each
```

### Pattern 5: Audio Buffering (Memory Efficiency)

```python
# GOOD - Fixed-size ring buffer
class RingBuffer:
    def __init__(self, max_samples):
        self.buffer = np.zeros(max_samples, dtype=np.float32)
        self.idx = 0

    def append(self, audio):
        n = len(audio)
        end = (self.idx + n) % len(self.buffer)
        if end > self.idx:
            self.buffer[self.idx:end] = audio
        else:
            self.buffer[self.idx:] = audio[:len(self.buffer)-self.idx]
            self.buffer[:end] = audio[len(self.buffer)-self.idx:]
        self.idx = end

# BAD - Unbounded list growth
chunks = []
chunks.append(audio)  # Memory leak over time
```

---

## 7. Implementation Patterns

### Pattern 1: Secure Faster Whisper Setup

```python
from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog

logger = structlog.get_logger()

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
        if model_size not in valid_sizes:
            raise ValueError(f"Invalid model size: {model_size}")

        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
        os.chmod(self.temp_dir, 0o700)

    def transcribe(self, audio_path: str) -> str:
        path = Path(audio_path).resolve()
        if not self._validate_audio_file(path):
            raise ValidationError("Invalid audio file")

        try:
            segments, info = self.model.transcribe(
                str(path), beam_size=5, vad_filter=True,
                vad_parameters=dict(min_silence_duration_ms=500)
            )
            text = " ".join(s.text for s in segments)
            logger.info("stt.transcribed", duration=info.duration)
            return text.strip()
        finally:
            path.unlink(missing_ok=True)

    def _validate_audio_file(self, path: Path) -> bool:
        if not path.exists():
            return False
        if path.stat().st_size > 50 * 1024 * 1024:
            return False
        return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}

    def cleanup(self):
        import shutil
        shutil.rmtree(self.temp_dir, ignore_errors=True)
```

### Pattern 2: Privacy-Preserving Transcription

```python
class PrivacyAwareSTT:
    """STT with privacy protections."""

    def __init__(self, engine: SecureSTTEngine):
        self.engine = engine

    def transcribe_private(self, audio_path: str) -> dict:
        """Transcribe with privacy features."""
        # Transcribe
        text = self.engine.transcribe(audio_path)

        # Remove PII patterns
        cleaned = self._remove_pii(text)

        # Log without content
        logger.info("stt.transcribed_private",
                   word_count=len(cleaned.split()),
                   had_pii=cleaned != text)

        return {
            "text": cleaned,
            "privacy_filtered": cleaned != text
        }

    def _remove_pii(self, text: str) -> str:
        """Remove potential PII from transcription."""
        import re

        # Phone numbers
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

        # Email addresses
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)

        # Social security numbers
        text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)

        # Credit card numbers
        text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)

        return text
```

---

## 8. Security Standards

**Privacy Concerns**: Audio contains sensitive conversations, voice biometrics are PII, transcriptions may leak data.

**Required Mitigations**:
```python
# Always delete after processing
def transcribe_and_delete(audio_path: str) -> str:
    try:
        return engine.transcribe(audio_path)
    finally:
        Path(audio_path).unlink(missing_ok=True)

# Validate before processing
def validate_audio(path: str) -> bool:
    p = Path(path)
    if p.stat().st_size > 50 * 1024 * 1024:
        raise ValidationError("File too large")
    if p.suffix.lower() not in {'.wav', '.mp3', '.flac'}:
        raise ValidationError("Invalid format")
    return True
```

---

## 9. Common Mistakes

### NEVER: Keep Audio Files

```python
# BAD - Audio persists
def transcribe(path):
    return model.transcribe(path)  # File remains

# GOOD - Delete after use
def transcribe(path):
    try:
        return model.transcribe(path)
    finally:
        Path(path).unlink()
```

### NEVER: Log Transcription Content

```python
# BAD - Logs sensitive content
logger.info(f"Transcribed: {text}")

# GOOD - Log metadata only
logger.info("stt.complete", word_count=len(text.split()))
```

---

## 10. Pre-Implementation Checklist

### Phase 1: Before Writing Code

- [ ] Read SKILL.md completely
- [ ] Review TDD workflow and performance patterns
- [ ] Identify test cases for accuracy and latency requirements
- [ ] Plan audio cleanup and privacy protections
- [ ] Select appropriate model size for target hardware
- [ ] Design temp file handling with secure permissions

### Phase 2: During Implementation

- [ ] Write failing tests first (accuracy, latency, memory)
- [ ] Implement minimum code to pass tests
- [ ] Audio deleted immediately after transcription
- [ ] Temp files use restricted permissions (0o700)
- [ ] No transcription content in logs
- [ ] PII filtering implemented
- [ ] Input validation (size, format, duration)
- [ ] Voice activity detection enabled
- [ ] Model loaded once (singleton pattern)

### Phase 3: Before Committing

- [ ] All tests pass: `pytest tests/test_stt_engine.py -v`
- [ ] Coverage above 80%: `pytest --cov=jarvis.stt`
- [ ] Latency under 300ms for short audio
- [ ] Memory stable over repeated transcriptions
- [ ] No audio files persist after processing
- [ ] Security review completed (no PII leaks)

---

## 11. Summary

Your goal is to create STT systems that are:
- **Private**: Audio processed locally, deleted immediately
- **Fast**: Optimized for real-time voice assistant responses
- **Accurate**: Appropriate model and preprocessing for context

You understand that voice data requires special privacy protection. Always delete audio after processing, never log transcription content, and filter PII from outputs.

**Critical Reminders**:
1. Delete audio files immediately after transcription
2. Never log transcription content
3. Filter PII from transcription results
4. Use secure temp directories with restricted permissions
5. Validate all audio input (size, format, duration)