# ALME: Audio-LLM Modality Evaluation Benchmark

ALME evaluates how audio-LLMs handle conflicting audio and text inputs across **8 typologically diverse languages**: English, German, French, Italian, Portuguese, Arabic, Japanese, and Chinese.

The core metric is **Text Dominance Ratio (TDR)**: the proportion of conflict trials where the model follows the (incorrect) text over the (correct) audio signal.

## Quick Start

### Prerequisites

- Linux with NVIDIA GPU (12GB+ VRAM with 8-bit quantization, 24GB+ without)
- Python 3.10+
- [uv](https://docs.astral.sh/uv/) package manager
- Common Voice Corpus 22.0 for all 8 languages (see [INSTRUCTIONS.md](INSTRUCTIONS.md))

### Install

```bash
git clone https://github.com/jb1999/alme-benchmark.git
cd alme-benchmark
uv sync
```

### Verify audio data

```bash
uv run python scripts/verify_audio.py --cv-root /path/to/cv-corpus-22.0-2025-06-20
```

### Run evaluation

```bash
# Quick test (100 stimuli)
uv run alme-eval --cv-root /path/to/cv-corpus-22.0-2025-06-20 --max-stimuli 100

# Full evaluation (57,602 stimuli, ~24 hours on a single GPU)
uv run alme-eval --cv-root /path/to/cv-corpus-22.0-2025-06-20 --output results/ultravox.json
```

### Check results

```bash
uv run python scripts/regression_test.py --results results/ultravox_metrics.json
```

## Reference Results (Ultravox v0.6)

### Overall

| Metric | Value |
|--------|-------|
| TDR (all stimuli) | 48.8% |
| Total stimuli | 57,602 |

### TDR by Language

| Language | TDR | n |
|----------|-----|---|
| EN | 39.2% | 7,325 |
| DE | 48.5% | 7,199 |
| FR | 47.7% | 7,405 |
| IT | 40.1% | 7,292 |
| PT | 49.4% | 7,235 |
| AR | 53.8% | 6,883 |
| JA | 55.1% | 7,013 |
| ZH | 57.6% | 7,116 |

### TDR by Flip Type

| Flip Type | TDR | n |
|-----------|-----|---|
| adjective_swap | 49.1% | 14,187 |
| negation_add | 59.5% | 6,426 |
| negation_remove | 43.0% | 8,236 |
| number_swap | 47.0% | 14,082 |
| time_swap | 49.0% | 14,537 |

## TTS Resynthesis Data

Pre-synthesized Azure Neural TTS audio for all 57,602 stimuli is available as a [GitHub Release](https://github.com/jb1999/alme-benchmark/releases/tag/tts-audio-v1). This enables the TTS resynthesis experiment (replacing natural Common Voice audio with synthetic speech).

Download and extract:

```bash
# Download all languages (~7.4 GB total)
for lang in ar de en fr it ja pt zh; do
  gh release download tts-audio-v1 -p "tts_audio_${lang}.tar.gz"
done

# Extract into data/tts_audio/
mkdir -p data/tts_audio
for f in tts_audio_*.tar.gz; do
  tar xzf "$f" -C data/tts_audio/
done
```

See [INSTRUCTIONS.md](INSTRUCTIONS.md) for detailed TTS evaluation instructions.

## Extending with New Models

To add a new model, create a class that inherits from `alme.models.base.ModelAdapter`:

```python
from alme.models.base import ModelAdapter, ModelResponse

class MyModelAdapter(ModelAdapter):
    @property
    def model_id(self) -> str:
        return "my-model"

    def supports_mode(self, mode: str) -> bool:
        return mode in ("audio_only", "text_only",
                        "audio_text_aligned", "audio_text_conflict")

    def run(self, audio_path, text, question, choices,
            system_prompt, condition="") -> ModelResponse:
        # Your inference code here
        ...

    def load(self) -> None:
        # Load model weights
        ...

    def unload(self) -> None:
        # Free GPU memory
        ...
```

Then register it in `alme/models/__init__.py`:

```python
_REGISTRY = {
    "ultravox": ("ultravox", "UltravoxAdapter", {}),
    "my-model": ("my_model", "MyModelAdapter", {}),
}
```

## Citation

If you use ALME in your research, please cite:

```bibtex
@article{billa2026alme,
  title={When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration},
  author={Billa, Jayadev},
  journal={arXiv preprint arXiv:2602.11488},
  year={2026}
}
```

Paper: [arXiv:2602.11488](https://arxiv.org/abs/2602.11488)

## License

Apache 2.0. See [LICENSE](LICENSE).

For detailed setup instructions, troubleshooting, and advanced usage, see [INSTRUCTIONS.md](INSTRUCTIONS.md).