> [!IMPORTANT]
> **License Notice**
> This codebase is released under **Apache License** and all model weights are released under **CC-BY-NC-SA-4.0 License**. Please refer to [LICENSE](LICENSE) for more details.
> [!WARNING]
> **Legal Disclaimer**
> We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
## FishAudio-S1
**True human-like Text-to-Speech and Voice Cloning**
FishAudio-S1 is an expressive text-to-speech (TTS) and voice cloning model developed by [Fish Audio](https://fish.audio/), designed to generate speech that sounds natural, realistic, and emotionally rich — not robotic, not flat, and not constrained to studio-style narration.
FishAudio-S1 focuses on how humans actually speak: with emotion, variation, pauses, and intent.
### Announcement 🎉
We are excited to announce that we have rebranded to **Fish Audio** — introducing a revolutionary new series of advanced Text-to-Speech models that builds upon the foundation of Fish-Speech.
We are proud to release **FishAudio-S1** (also known as OpenAudio S1) as the first model in this series, delivering significant improvements in quality, performance, and capabilities.
FishAudio-S1 comes in two versions: **FishAudio-S1** and **FishAudio-S1-mini**. Both models are now available on [Fish Audio Playground](https://fish.audio) (for **FishAudio-S1**) and [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini) (for **FishAudio-S1-mini**).
Visit the [Fish Audio website](https://fish.audio/) for live playground tech report.
### Model Variants
| Model | Size | Availability | Description |
|------|------|-------------|-------------|
| FishAudio-S1 | 4B parameters | [fish.audio](https://fish.audio/) | Full-featured flagship model with maximum quality and stability |
| FishAudio-S1-mini | 0.5B parameters | [huggingface](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) | Open-source distilled model with core capabilities |
Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).
### Start Here
Here are the official documents for Fish Speech, follow the instructions to get started easily.
- [Installation](https://speech.fish.audio/install/)
- [Finetune](https://speech.fish.audio/finetune/)
- [Inference](https://speech.fish.audio/inference/)
- [Samples](https://speech.fish.audio/samples/)
## Highlights
### **Excellent TTS quality**
We use Seed TTS Eval Metrics to evaluate the model performance, and the results show that FishAudio S1 achieves **0.008 WER** and **0.004 CER** on English text, which is significantly better than previous models. (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM)
| Model | Word Error Rate (WER) | Character Error Rate (CER) | Speaker Distance |
|-------|----------------------|---------------------------|------------------|
| **S1** | **0.008** | **0.004** | **0.332** |
| **S1-mini** | **0.011** | **0.005** | **0.380** |
### **Best Model in TTS-Arena2** 🏆
FishAudio S1 has achieved the **#1 ranking** on [TTS-Arena2](https://arena.speechcolab.org/), the benchmark for text-to-speech evaluation:
### True Human-Like Speech
FishAudio-S1 generates speech that sounds natural and conversational rather than robotic or overly polished. The model captures subtle variations in timing, emphasis, and prosody, avoiding the “studio recording” effect common in traditional TTS systems.
### **Emotion Control and Expressiveness**
FishAudio S1 is the first TTS model to support **open-domain fine-grained emotion control** through explicit emotion and tone markers. We can now precisely steer how a voice sounds:
- **Basic emotions**:
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
- **Advanced emotions**:
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
- **Tone markers**:
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
- **Special audio effects**:
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```
You can also use Ha,ha,ha to control, there's many other cases waiting to be explored by yourself.
### Multilingual Support
FishAudio-S1 supports high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing.
**Languages supporting emotion markers include:**
English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, and Portuguese.
The list is constantly expanding, check [Fish Audio](https://fish.audio/) for the latest releases.
### Rapid Voice Cloning
FishAudio-S1 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
## **Features**
1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/resources/best-practices/voice-cloning).**
2. **Multilingual & Cross-lingual Support:** Simply copy and paste multilingual text into the input box—no need to worry about the language. Currently supports English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish.
3. **No Phoneme Dependency:** The model has strong generalization capabilities and does not rely on phonemes for TTS. It can handle text in any language script.
4. **Highly Accurate:** Achieves a low CER (Character Error Rate) of around 0.4% and WER (Word Error Rate) of around 0.8% for Seed-TTS Eval.
5. **Fast:** Accelerated by torch compile, the real-time factor is approximately 1:7 on an Nvidia RTX 4090 GPU.
6. **WebUI Inference:** Features an easy-to-use, Gradio-based web UI compatible with Chrome, Firefox, Edge, and other browsers.
7. **Deploy-Friendly:** Easily set up an inference server with native support for Linux and Windows (macOS support coming soon), minimizing performance loss.
## **Media & Demos**