# STT-CLI v2.0 Architecture Documentation
**Version:** 2.0.0
**Last Updated:** November 26, 2025
**Author:** Mantej Singh Dhanjal
---
## Table of Contents
1. [System Overview](#system-overview)
2. [Architecture Layers](#architecture-layers)
3. [Component Diagram](#component-diagram)
4. [Data Flow](#data-flow)
5. [Technology Stack](#technology-stack)
6. [Design Decisions](#design-decisions)
7. [Security Considerations](#security-considerations)
---
## System Overview
STT-CLI v2.0 is a **Windows-only**, **background-running**, **hybrid speech-to-text** application with support for:
- **Offline mode**: OpenAI Whisper (MIT licensed)
- **Online mode**: Google Web Speech API
- **Auto-detect mode**: Intelligent switching based on internet connectivity
### Key Characteristics
- **Single-file executable**: No installation required (~150MB with Whisper dependencies)
- **System tray application**: No visible window, GUI-less operation
- **Event-driven architecture**: Global hotkey triggers recording
- **Multi-threaded**: Main thread, tray thread, recording thread
- **Security-focused**: Only types into CLI windows (cmd.exe, powershell.exe, WindowsTerminal.exe)
---
## Architecture Layers
### Layer 1: User Interface (System Tray)
**Component:** `pystray.Icon`
**Thread:** Daemon thread (tray_thread)
**Responsibilities:**
- Display system tray icon (idle vs listening states)
- Provide context menu with:
- ποΈ Engine submenu (Auto/Whisper/Google)
- Start on Windows Boot (checkable)
- About (version info)
- Quit
- Show balloon notifications for:
- Recording start/stop
- Engine changes
- First-run welcome
- About information
**Icons:**
- `stt-cli2.ico` - Idle state (blue microphone)
- `stt-cli2.png` - Listening state (red/active microphone)
---
### Layer 2: Input Layer (Global Hotkey)
**Component:** `pynput.keyboard.Listener`
**Thread:** Background keyboard listener
**Responsibilities:**
- Listen for **Left Alt** key presses globally
- Detect **double-press** within 0.3s window
- Implement **cooldown** (0.8s) to prevent multi-toggles
- Trigger `toggle_recording()` on valid double-press
- Support ESC key for emergency quit
**State Management:**
- `last_press_time` - Timestamp for double-press detection
- `last_toggle_time` - Timestamp for cooldown enforcement
- `recording_event` - Threading.Event() for recording state
---
### Layer 3: Audio Capture
**Component:** `speech_recognition.Microphone`
**Thread:** Recording thread (spawned on-demand)
**Responsibilities:**
- Adjust for ambient noise (0.2s calibration)
- Continuous listening with 1-second timeout
- Convert audio to WAV format (16kHz, 16-bit)
- Pass audio data to transcription layer
**Audio Flow:**
```
Microphone β SpeechRecognition β AudioData β Transcription Engine
```
---
### Layer 4: Transcription Layer (Hybrid Engine)
**Components:**
- **Whisper Engine**: `faster-whisper` with CTranslate2 backend
- **Google Engine**: `speech_recognition.recognize_google()`
#### 4A: Whisper Engine (Offline)
**Library:** `faster-whisper` v1.2.1
**Model:** Systran/faster-whisper-tiny (39M parameters, ~75MB)
**Backend:** CTranslate2 with INT8 quantization
**Configuration:**
```python
model = WhisperModel(
"tiny",
device="cpu",
compute_type="int8", # 2-3x speedup on CPU
download_root="%APPDATA%/stt-cli/models",
num_workers=1
)
segments, info = model.transcribe(
audio_array,
language="en", # Skip auto-detection (~200ms saved)
beam_size=1, # Greedy decoding (no beam search)
best_of=1, # No sampling
temperature=0.0, # Deterministic output
vad_filter=False, # Disable VAD (we handle silence)
word_timestamps=False # Skip timestamp calculation
)
```
**Performance:**
- **Latency:** 1-2 seconds for typical speech (2-5 seconds audio)
- **Accuracy:** Fair (tiny model trades accuracy for speed)
- **First-run:** 5-10s initial load + model download
- **Subsequent:** Instant (model cached in memory)
#### 4B: Google Engine (Online)
**Library:** `speech_recognition` v3.14.3
**API:** Google Web Speech API (free tier)
**Configuration:**
```python
text = recognizer.recognize_google(audio)
```
**Performance:**
- **Latency:** 0.5 seconds (faster than Whisper)
- **Accuracy:** Excellent (cloud-powered)
- **Requires:** Active internet connection
#### 4C: Engine Selection Logic
**Auto-Detect Mode** (Default behavior before v2.0):
```python
def get_current_engine() -> STTEngineType:
if current_engine == "auto":
if has_internet():
return "google" # Fast + accurate
else:
return "whisper" if WHISPER_AVAILABLE else "google"
return current_engine
```
**Manual Mode** (v2.0+):
- User selects engine via system tray menu
- Choice saved to `%APPDATA%\stt-cli\settings.json`
- Persists across app restarts
---
### Layer 5: Output Layer (CLI Window Detection)
**Component:** Win32 API + `pynput.keyboard.Controller`
**Responsibilities:**
- Detect foreground window using `win32gui.GetForegroundWindow()`
- Get process name via `win32process` + `psutil`
- Validate against CLI whitelist:
- `cmd.exe`
- `powershell.exe`
- `WindowsTerminal.exe`
- `wt.exe` (Windows Terminal alias)
- Type transcription + space character only if CLI is active
**Security Rationale:**
Prevents accidental typing of sensitive transcriptions into browsers, text editors, or other applications.
---
### Layer 6: Configuration & Persistence
**Component:** JSON-based settings in `%APPDATA%\stt-cli\`
**Settings File:** `settings.json`
```json
{
"auto_start": true,
"first_run": false,
"stt_engine": "whisper",
"whisper_model": "tiny",
"version": "2.0.0"
}
```
**Auto-Start Management:**
- Creates shortcut in `%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup\`
- Uses `win32com.client` COM interface for shortcut creation
- Persists across Windows reboots
---
## Component Diagram
Visual flow diagram showing component interactions and data flow
See [STT-CLI-Architecture.drawio](./STT-CLI-Architecture.drawio) for the editable draw.io source file.
**Key Components:**
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER INTERACTION β
β (Double-tap Left Alt, Right-click tray menu) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SYSTEM TRAY (pystray) β
β - Icon updates (idle/listening) β
β - Menu: Engine, Auto-start, About, Quit β
β - Notifications (balloon tooltips) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GLOBAL HOTKEY LISTENER (pynput) β
β - Detects double-press Left Alt β
β - Cooldown enforcement (0.8s) β
β - Calls toggle_recording() β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RECORDING THREAD (daemon) β
β - Adjust for ambient noise β
β - Continuous listening (1s timeout) β
β - Convert audio β AudioData β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββ΄βββββββββββββ
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ
β WHISPER ENGINE β β GOOGLE ENGINE β
β (Offline) β β (Online) β
β - faster-whisperβ β - Web Speech β
β - CTranslate2 β β - API call β
β - INT8 quant β β - 0.5s latency β
β - 1-2s latency β ββββββββββ¬ββββββββββ
ββββββββββ¬ββββββββββ β
ββββββββββββββ¬βββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLI WINDOW DETECTION (Win32) β
β - Get foreground window β
β - Check process name β
β - Whitelist: cmd, powershell, WindowsTerminal β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OUTPUT (pynput keyboard controller) β
β - Type transcription + space β
β - Only if CLI window is active β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
## Data Flow
### Recording Start Flow
```
1. User double-taps Left Alt
β
2. on_press() detects double-press (within 0.3s)
β
3. Check cooldown (must be > 0.8s since last toggle)
β
4. toggle_recording() called
β
5. recording_event.set() β State = RECORDING
β
6. Icon changes to "listening" (stt-cli2.png)
β
7. Notification shown: "Recording Started - Using: Whisper (Offline)"
β
8. recording_thread spawned (daemon=True)
β
9. recording_loop() starts continuous listening
```
### Transcription Flow (Whisper)
```
1. recording_loop() captures audio
β
2. recognizer.listen(source, timeout=1) β AudioData
β
3. get_current_engine() β "whisper"
β
4. transcribe_with_whisper(audio)
β
5. get_whisper_model() - lazy load with thread lock
β
6. Audio conversion: WAV β numpy array (16kHz, float32)
β
7. model.transcribe(audio_array, language="en", beam_size=1...)
β
8. Segments joined β text string
β
9. Log: "[WHISPER] Transcribed: {text}"
β
10. is_cli_window() β Check foreground window
β
11. If CLI: Type text + space using keyboard_controller
If NOT CLI: Log "Active window is not a CLI, skipping"
```
### Engine Switching Flow
```
1. User right-clicks tray icon
β
2. Hovers over "ποΈ Engine"
β
3. Selects "Whisper (Offline)"
β
4. set_engine("whisper") called
β
5. current_engine = "whisper" (global state update)
β
6. save_settings() β Persist to settings.json
β
7. Notification: "Engine Changed - Now using: Whisper (Offline)"
β
8. icon.update_menu() β Checkmark moves to "Whisper"
```
---
## Technology Stack
### Core Dependencies
| Library | Version | Purpose |
|---------|---------|---------|
| `faster-whisper` | 1.2.1 | Offline speech recognition (4x faster than vanilla Whisper) |
| `av` (PyAV) | 16.0.1 | Audio decoding for Whisper (FFmpeg bindings) |
| `ctranslate2` | 4.6.1 | Fast inference engine for Whisper (INT8 quantization) |
| `numpy` | Latest | Audio array processing |
| `SpeechRecognition` | 3.14.3 | Google Web Speech API integration + microphone access |
| `pynput` | 1.8.1 | Global keyboard listener + controller |
| `PyAudio` | 0.2.14 | Microphone capture (SpeechRecognition dependency) |
| `pystray` | 0.19.5 | System tray icon + menu |
| `Pillow` | 10.4.0 | Image handling for tray icons |
| `pywin32` | 311 | Windows API access (COM, Win32GUI) |
| `psutil` | 7.1.1 | Cross-platform process utilities |
### Build Tools
| Tool | Purpose |
|------|---------|
| `PyInstaller` | Single-file .exe packaging with hidden imports |
| `certutil` | SHA256 hash calculation for winget manifests |
| `draw.io` | Architecture diagram creation |
---
## Design Decisions
### 1. Why Hybrid Architecture?
**Problem:** Corporate laptops often have:
- Win+H disabled by IT policies
- Restricted/intermittent internet access
- Firewall blocks to cloud speech APIs
**Solution:** Hybrid approach provides:
- **Offline fallback**: Whisper works without internet
- **Accuracy when available**: Google is faster + more accurate online
- **User control**: Manual engine selection for specific scenarios
### 2. Why "tiny" Whisper Model?
**Rationale:**
- **Speed > Accuracy** for CLI use case (quick commands, short dictation)
- **Small download** (75MB vs 460MB for "small" model)
- **Low RAM** (~1.5GB vs 2GB+ for larger models)
- **Corporate laptop friendly** (many users have limited resources)
**Trade-off:** Fair accuracy vs excellent speed. Users can upgrade to "base" by editing settings.json.
### 3. Why Lazy Model Loading?
**Rationale:**
- **Fast startup** (app starts in <2s, no Whisper delay)
- **Memory efficient** (model only loaded when needed)
- **Google-first users** (those who never use Whisper don't pay the cost)
**Implementation:** Thread lock ensures single model instance, loaded on first Whisper transcription.
### 4. Why CLI-Only Output?
**Security Rationale:**
- Prevents typing sensitive transcriptions into:
- Web browsers (could expose passwords, personal info)
- Text editors (unintended file modifications)
- Chat applications (accidental messages)
- CLI windows are explicit targets (Windows Terminal, PowerShell, CMD)
**User Benefit:** Confidence that speech won't leak into wrong applications.
### 5. Why Global State Instead of Classes?
**Rationale:**
- **Simplicity**: Single-file application with ~800 lines
- **Threading compatibility**: Global state with locks is clearer than class state
- **Maintainability**: Easier to understand for small projects
**Trade-off:** Not scalable for large projects, but appropriate for STT-CLI's scope.
---
## Security Considerations
### 1. CLI Window Whitelist
**Threat:** Accidental typing into sensitive applications
**Mitigation:** Hardcoded whitelist of CLI process names
**Code Location:** `main.pyw:is_cli_window()` (line ~460)
### 2. Audio Privacy (Whisper Mode)
**Threat:** Audio transmitted to cloud without user knowledge
**Mitigation:**
- Whisper runs 100% locally (no network calls)
- User can verify via network monitoring
- Clear UI indication of engine in use
### 3. Model Download Security
**Threat:** Man-in-the-middle attack during model download
**Mitigation:**
- Downloads from Hugging Face CDN (HTTPS)
- Model integrity checked via hash (Hugging Face built-in)
- One-time download, cached locally
### 4. Settings File Access
**Threat:** Unauthorized modification of settings.json
**Mitigation:**
- Stored in `%APPDATA%` (user-only access)
- No sensitive data stored (only UI preferences)
- JSON schema validation on load
---
## Related Documentation
- [WHISPER_MODEL.md](./WHISPER_MODEL.md) - Model download, caching, lifecycle
- [THREADING.md](./THREADING.md) - Threading model and synchronization
- [STT-CLI-Architecture.drawio](./STT-CLI-Architecture.drawio) - Visual flow diagrams
---
**End of Architecture Documentation**