> [!IMPORTANT]
> **License Notice**
> This codebase is released under **Apache License** and all model weights are released under **CC-BY-NC-SA-4.0 License**. Please refer to [LICENSE](LICENSE) for more details.
> [!WARNING]
> **Legal Disclaimer**
> We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
## Start Here
Here are the official documents for Fish Speech, follow the instructions to get started easily.
- [Installation](https://speech.fish.audio/install/)
- [Finetune](https://speech.fish.audio/finetune/)
- [Inference](https://speech.fish.audio/inference/)
- [Samples](https://speech.fish.audio/examples)
## 🎉 Announcement
We are excited to announce that we have rebranded to **OpenAudio** — introducing a revolutionary new series of advanced Text-to-Speech models that builds upon the foundation of Fish-Speech.
We are proud to release **OpenAudio-S1** as the first model in this series, delivering significant improvements in quality, performance, and capabilities.
OpenAudio-S1 comes in two versions: **OpenAudio-S1** and **OpenAudio-S1-mini**. Both models are now available on [Fish Audio Playground](https://fish.audio) (for **OpenAudio-S1**) and [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini) (for **OpenAudio-S1-mini**).
Visit the [OpenAudio website](https://openaudio.com/blogs/s1) for blog & tech report.
## Highlights ✨
### **Excellent TTS quality**
We use Seed TTS Eval Metrics to evaluate the model performance, and the results show that OpenAudio S1 achieves **0.008 WER** and **0.004 CER** on English text, which is significantly better than previous models. (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM)
| Model | Word Error Rate (WER) | Character Error Rate (CER) | Speaker Distance |
|-------|----------------------|---------------------------|------------------|
| **S1** | **0.008** | **0.004** | **0.332** |
| **S1-mini** | **0.011** | **0.005** | **0.380** |
### **Best Model in TTS-Arena2** 🏆
OpenAudio S1 has achieved the **#1 ranking** on [TTS-Arena2](https://arena.speechcolab.org/), the benchmark for text-to-speech evaluation:
### **Speech Control**
OpenAudio S1 **supports a variety of emotional, tone, and special markers** to enhance speech synthesis:
- **Basic emotions**:
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
- **Advanced emotions**:
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
- **Tone markers**:
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
- **Special audio effects**:
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```
You can also use Ha,ha,ha to control, there's many other cases waiting to be explored by yourself.
(Support for English, Chinese and Japanese now, and more languages is coming soon!)
### **Two Type of Models**
| Model | Size | Availability | Features |
|-------|------|--------------|----------|
| **S1** | 4B parameters | Avaliable on [fish.audio](https://fish.audio/) | Full-featured flagship model |
| **S1-mini** | 0.5B parameters | Avaliable on huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) | Distilled version with core capabilities |
Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).
## **Features**
1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/resources/best-practices/voice-cloning).**
2. **Multilingual & Cross-lingual Support:** Simply copy and paste multilingual text into the input box—no need to worry about the language. Currently supports English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish.
3. **No Phoneme Dependency:** The model has strong generalization capabilities and does not rely on phonemes for TTS. It can handle text in any language script.
4. **Highly Accurate:** Achieves a low CER (Character Error Rate) of around 0.4% and WER (Word Error Rate) of around 0.8% for Seed-TTS Eval.
5. **Fast:** Accelerated by torch compile, the real-time factor is approximately 1:7 on an Nvidia RTX 4090 GPU.
6. **WebUI Inference:** Features an easy-to-use, Gradio-based web UI compatible with Chrome, Firefox, Edge, and other browsers.
7. **Deploy-Friendly:** Easily set up an inference server with native support for Linux and Windows (macOS support coming soon), minimizing performance loss.
## **Media & Demos**