
[](https://github.com/jingyaogong/minimind-o/stargazers)
[](LICENSE)
[](https://github.com/jingyaogong/minimind-o/commits/master)
[](https://github.com/jingyaogong/minimind-o/pulls)
[](https://huggingface.co/collections/jingyaogong/minimind-o)
[](http://arxiv.org/abs/2605.03937)
[📄 MiniMind-O Technical Report](http://arxiv.org/abs/2605.03937)
https://github.com/user-attachments/assets/10cbcc5f-4e70-45cf-bdc5-d6361e40bb86
[🔗 Online Demo (Gradio)](https://modelscope.cn/studios/gongjy/MiniMind-O) | [🔗 Video Intro](https://www.bilibili.com/video/BV1V1RsBcEMX)
---
# 📌 Project Introduction
After [MiniMind](https://github.com/jingyaogong/minimind) (LLM) and [MiniMind-V](https://github.com/jingyaogong/minimind-v) (VLM), MiniMind-O is the third stop in this series. By "Omni" we mean a model that can listen, see and speak at the same time: it takes text, speech and visual signals as inputs, and produces text together with streaming speech.
GPT-4o was probably the first system that made natural streaming voice interaction feel real. Since then, open-source projects such as Mini-Omni2, Moshi, GLM-4-Voice and Qwen3-Omni have gradually appeared. However, if the goal is not just to call ready-made checkpoints with billions of parameters, but to fully understand, train and modify a complete Omni model from scratch, the open-source community still lacks a sufficiently lightweight starting point with an end-to-end pipeline. A common way to bring speech into an Omni model is to chain ASR, LLM and TTS into a cascade: speech is first transcribed to text, the LLM processes it, and the answer is then synthesized back to speech. This is straightforward from an engineering perspective, but it adds an extra transcription step and noticeably hurts latency, prosody and emotional cues.
MiniMind-O attempts to fill this gap: speech and text are connected directly at the hidden-state level, while the trainable backbone remains only ~0.1B parameters and the end-to-end Omni pipeline is preserved. The Talker side adopts MTP (Multi-Token Prediction) to predict multiple Mimi codebook layers at once, and combines it with VAD to support real-time barge-in and near-duplex interaction—a practical engineering route for a tiny Omni model. The code, model weights, training data and technical report are all open-sourced. A single RTX 3090 can finish training on the mini dataset in about 2 hours. The goal remains the same: let everyone read the project from the first line of code, and train, from scratch, a model that can listen, see, think and speak:

😊 Enjoy building.
---
#### 🎉 What this project provides
- A complete MiniMind-O architecture: Thinker, an independent Talker, audio / vision projectors, the Mimi codebook interface and the MTP audio head.
- A full SFT pipeline that covers T2A, I2T and A2A data, supporting full-parameter training, audio-projector-only training, vision-projector-only training, and DDP multi-GPU training.
- Two training datasets, `mini` and `full`. `mini` is meant for quick onboarding and runs the pipeline in ~2 hours on a single RTX 3090; `full` matches the released weights and covers Chinese speech and image tasks.
- Multiple built-in voice prompts, unseen voice prompts and voice cloning from arbitrary reference audio, making voice-control experiments easy to reproduce.
- A complete inference and demo toolkit: CLI, Web UI, streaming playback, barge-in interruption and a phone-mode demo.
- Key modules are written from scratch in native PyTorch without high-level third-party wrappers, while remaining compatible with `transformers` tokenizers and native weight formats.
- A companion technical report covers architecture, training curves, CER / WER evaluation, voice-cloning similarity and cross-model comparisons. See the Tech Report badge at the top.
#### 🎉 Released models
| Model | Backbone params | Release |
|---|---|---|
| minimind-3o | ~0.1B | 2026.05.05 |
| minimind-3o-moe | ~0.3B-A0.1B | 2026.05.05 |
---
#### 👉 Update Log
| Speaker | Reference | Output | Avg |
| dylan |
https://github.com/user-attachments/assets/070ea3ab-0e8e-4aa0-84b5-af8d3c4e2725
|
https://github.com/user-attachments/assets/eb2da7ed-173c-47e9-9431-7bdb5a9b7385
| 0.6712 |
| eric |
https://github.com/user-attachments/assets/c74aa5dc-1edd-44c1-9546-6e57194c2f60
|
https://github.com/user-attachments/assets/f3fa8906-4e14-4610-a9d9-c16c915ca1b3
| 0.4430 |
| serena |
https://github.com/user-attachments/assets/0eeeac87-fa70-4025-b66e-1f0197f2b434
|
https://github.com/user-attachments/assets/c5901dca-4b2a-47f5-9b30-c89de54f908e
| 0.6600 |
| uncle_fu |
https://github.com/user-attachments/assets/fdd1bb28-6648-44bf-8bcb-4509e709e347
|
https://github.com/user-attachments/assets/95b480f1-f015-4712-8d7c-17db465f6584
| 0.6632 |
| vivian |
https://github.com/user-attachments/assets/f64731c4-67a3-4e18-b7d7-61bf44ef4bdd
|
https://github.com/user-attachments/assets/3f1cc9bb-16d2-4ce0-a473-40676cf4523e
| 0.5320 |
#### Unseen voices
"Unseen" means voices not seen during training, used to inspect zero-shot transfer of a new reference voice into generated speech.
| Speaker | Reference | Output | Avg |
| arthur |
https://github.com/user-attachments/assets/3430ecdb-6de8-4fb0-a6a7-ad82bdce01a1
|
https://github.com/user-attachments/assets/e598dbc2-ba28-4c38-b52d-6fa6c2349a5b
| 0.6479 |
| chelsie |
https://github.com/user-attachments/assets/f9166af6-3a98-42f3-9cf8-ad105eea87d6
|
https://github.com/user-attachments/assets/eccca693-4708-409a-88f7-85eb25f66fe6
| 0.5975 |
| cherry |
https://github.com/user-attachments/assets/e69b9cac-e12f-43ae-a9dc-7e1618ef3a43
|
https://github.com/user-attachments/assets/bb41cdef-cc92-48fa-a508-76a75d391565
| 0.5418 |
| ethan |
https://github.com/user-attachments/assets/9c992505-2046-483e-a7cf-50ec18a5e329
|
https://github.com/user-attachments/assets/98013c5e-f5b5-4e1a-bc0e-a0f0be5d3240
| 0.4323 |
| jennifer |
https://github.com/user-attachments/assets/924b035d-5c7c-45a5-a8f8-5dbdc18f71db
|
https://github.com/user-attachments/assets/853d1370-0065-4567-9a71-dc88a6a34d56
| 0.4052 |
| momo |
https://github.com/user-attachments/assets/7e97f524-da6d-4a2f-9095-e7f99262f4a5
|
https://github.com/user-attachments/assets/4c193c8f-8750-4424-acba-2bd13089a634
| 0.5968 |
| moon |
https://github.com/user-attachments/assets/527df88a-adc0-48d3-9a6a-827ca1ba7fb0
|
https://github.com/user-attachments/assets/3f533e26-1ad8-4ab3-baf1-21267734d3ee
| 0.5874 |
## Ⅲ Cross-model English T2A comparison
We selected 20 English questions, all constrained by `Answer briefly in one short sentence`. The intent is not to evaluate open-ended English ability, but to keep response lengths within a similar range. The three models then synthesize audio, which is uniformly transcribed by Qwen3-ASR; CER / WER between transcription and target text are used to compare Talker-side textual consistency.
| Length bucket | [Mini-Omni](https://huggingface.co/gpt-omni/mini-omni) CER/WER | [Mini-Omni2](https://huggingface.co/gpt-omni/mini-omni2) CER/WER | minimind-3o CER/WER |
|---|---|---|---|
| short (≤15w) | 0.0195 / 0.0384 (n=8) | 0.0503 / 0.0584 (n=14) | 0.0531 / 0.0417 (n=8) |
| mid (16–30w) | 0.0038 / 0.0052 (n=12) | 0.0062 / 0.0076 (n=6) | 0.1327 / 0.1420 (n=11) |
| long (31–60w) | — | — | 0.0431 / 0.0508 (n=1) |
For replies of ≤15 words, minimind-3o is already close to Mini-Omni2; the gap really opens up at 16–30 words. This length is no longer a simple phrase, and Talker must keep pronunciation, rhythm and surface form consistent in a complete short sentence simultaneously. This is also the regime where the current 0.1B Talker most easily exposes its instability.
## Ⅳ Cross-model vision-language comparison
[Mini-Omni](https://huggingface.co/gpt-omni/mini-omni) does not support a VL path, so the comparison is between [Mini-Omni2](https://huggingface.co/gpt-omni/mini-omni2) (0.5B) and minimind-3o (0.1B). On 9 synthetic images, both models generate English answers, which are then uniformly transcribed and used to compute CER / WER as a vision-to-speech consistency reference.
| Model | Params | Avg CER ↓ | Avg WER ↓ |
|---|---|---|---|
| [Mini-Omni2](https://huggingface.co/gpt-omni/mini-omni2) | 0.5B | 0.7609 | 0.9756 |
| minimind-3o | 0.1B | 0.8241 | 1.0293 |
These numbers should not be read as the absolute correctness of open-ended image description. Image captioning has many equivalent expressions, and synonym choices and word order both affect CER / WER, so high absolute values are expected. Under the same automatic pipeline, minimind-3o trails behind Mini-Omni2 but stays in the same order of magnitude, with roughly 1/5 the parameters.
## Ⅴ Qualitative samples

In speech-to-speech samples, the input is real speech, Thinker organizes the semantics, and Talker renders speech. Short replies are again the more stable regime; Chinese explanatory questions usually produce coherent answers, while English pronunciation and rhythm are relatively more stable.