# AGENTS.md — Speech Translation Project Guidelines for AI coding agents working on this Vietnamese-English Speech Translation repository. ## Project Overview This repository contains training and inference code for Speech-to-Text Translation (ST) models using SeamlessM4T for the Vietnamese ↔ English language pair. ## Build / Test / Lint Commands ### Code Quality (via seamless_communication submodule) ```bash cd src/seamless_communication # Format code with Black black src/ tests/ # Type checking with mypy mypy src/ # Pre-commit hooks pre-commit run --all-files ``` ### Installation ```bash # Install SeamlessM4T dependencies ## Code Style Guidelines ### Imports - **Standard library** first, **third-party** second, **local** third - Group imports with a blank line between groups - Use absolute imports over relative imports - Example: ```python import json import os from pathlib import Path from typing import List, Optional import torch import torchaudio from tqdm import tqdm from src.metrics import compute_wer ``` ### Formatting - **Black** code formatter (line length: 88 characters) - **isort** with "black" profile for import sorting - Use double quotes for strings consistently - Trailing commas in multi-line structures ### Type Hints - Use Python 3.8+ typing syntax - Annotate function parameters and return types - Use `Optional[Type]` for nullable values - Use `List[Type]`, `Dict[Key, Value]` from typing module - Example: ```python def compute_cer(reference: str, hypothesis: str, normalize: bool = True) -> float: ... ``` ### Naming Conventions - **snake_case** for functions, variables, methods - **PascalCase** for classes - **SCREAMING_SNAKE_CASE** for constants - Private methods/functions prefix with underscore: `_helper()` - Example: ```python MAX_SAMPLES = 50000 TARGET_SAMPLE_RATE = 16_000 def load_audio(filepath: str) -> tuple[torch.Tensor, int]: ... class MetricsEvaluator: def _normalize_text(self, text: str) -> str: ... ``` ### Docstrings - Use Google-style docstrings - Document all public functions and classes - Include Args, Returns, and Raises sections - Example: ```python def evaluate(self, references: List[str], hypotheses: List[str]) -> MetricsResult: """ Evaluate all metrics on the full corpus. Args: references: List of ground-truth strings. hypotheses: List of model output strings (same length). Returns: MetricsResult with aggregated CER, WER and BLEU. Raises: ValueError: If references and hypotheses differ in length. """ ``` ### Error Handling - Use specific exceptions (ValueError, RuntimeError, etc.) - Raise exceptions with descriptive messages - Handle expected errors gracefully in inference code - Use try/except blocks sparingly, only for expected failures - Example: ```python if len(ref_chars) == 0: if len(hyp_chars) == 0: return 0.0 raise ValueError("Reference is empty but hypothesis is not.") ``` ### Comments - Write docstrings for modules, classes, and public functions - Use inline comments sparingly, only for non-obvious logic - Prefix comments with `# ` (space after hash) - Use section dividers for long files: ```python # --------------------------------------------------------------------------- # Helpers # --------------------------------------------------------------------------- ``` ## Project Structure ``` SpeechTranslation/ ├── src/ │ ├── metrics.py # CER/WER/BLEU evaluation metrics │ ├── llm.py # Gemini LLM wrapper │ ├── early_stopping.py # Training utilities │ └── seamless_communication/ # SeamlessM4T model code ├── scripts/ │ ├── prepare_data.py # JSONL → TSV conversion │ ├── train_spm.py # SentencePiece training │ └── compute_gcmvn.py # GCMVN statistics ├── inference/ │ ├── seamless_infer.py # SeamlessM4T inference │ ├── single_infer.py # Single audio inference │ └── batch_infer.py # Multi-GPU batch inference ├── configs/ # Training configs ├── datasets/ # JSONL datasets (metadata) └── data/ # TSV manifests ``` ## Key Technologies - **PyTorch / fairseq2** — Deep learning framework - **torchaudio** — Audio processing - **sentencepiece** — Text tokenization - **sacrebleu** — BLEU score computation - **hydra** — Configuration management - **unittest** — Testing framework ## Language Considerations This project handles both **English** and **Vietnamese** text: - Use Unicode NFC normalization for text comparison - Vietnamese requires special handling for diacritics - Use sacrebleu's "char" tokenizer for Vietnamese BLEU scores