# Architecture: Global Transformer + Delay-Pattern (MossTTSDelay)
This document details the **MossTTSDelay** architecture, the production-grade variant of the MOSS-TTS family. It employs a **Single Transformer** backbone with **Multi-Head Parallel Prediction** and a **Delay-Pattern** scheduling mechanism to achieve high-speed, stable, and long-form speech synthesis. The architecture diagram is shown in the figure.
---
## 1. Overview: Parallel Heads + Delay Pattern
Unlike the **MossTTSLocal** architecture which uses a hierarchical "Temporal + Depth" approach, **MossTTSDelay** integrates all modeling into a single large-scale Transformer. It achieves efficient multi-codebook modeling by shifting the RVQ layers in the time domain, allowing the model to predict all codebook layers for a given step simultaneously through multiple linear heads.
### Key Components
* **Unified Transformer Backbone:** A large-scale language model (based on the **Qwen-8B** scale) that handles text encoding, prosody modeling, and audio token prediction in a single forward pass.
* **Multi-Head Output Layer:** The backbone is equipped with **$1 + N_q$** (where $N_q=32$) prediction heads. One head manages the primary sequence logic, while the other 32 heads parallelly predict the RVQ codebook layers.
* **Delay-Pattern Scheduling:** A specialized data formatting technique that introduces a 1-step offset between successive RVQ layers. This enables causal dependency modeling across codebook depths without requiring an additional "Depth Transformer."
---
## 2. Technical Specifications
| Feature | Specification |
| :--- | :--- |
| **Backbone Model** | Initialized from **Qwen-8B** scale |
| **Prediction Heads** | **33 LM Heads** (1 Main + 32 RVQ Heads) |
| **Audio Tokenizer** | **Cat** (Causal Audio Tokenizer) |
| **Sampling Rate** | 24,000 Hz |
| **Frame Rate** | 12.5 Hz (1s ≈ 12.5 tokens) |
| **Codebooks** | 32 RVQ layers (10-bit each) |
| **Generation Mode** | Parallel Autoregressive (Delay-Pattern) |
| **Primary Advantage** | Inference speed & Long-context stability |
---
## 3. Core Mechanism: Multi-Head Parallel Prediction
The defining characteristic of MossTTSDelay is its **computational efficiency**. By attaching 32 independent linear heads to the final hidden state of the Transformer backbone, the model can generate an entire frame's worth of multi-layer RVQ tokens in a **single forward step**.
### Why this is faster than MossTTSLocal:
* **No Nested Loops:** While the Local architecture requires a secondary "Local Transformer" to iterate through each RVQ layer within one time step, MossTTSDelay computes all layers in parallel.
* **Direct Projection:** The relationship between codebook layers is captured by the backbone's internal representations and the delay-pattern, removing the latency overhead of a dedicated depth-modeling module.
---
## 4. Prediction Topology: Delay-Pattern
To maintain the hierarchical dependency of RVQ (where Layer $k$ should ideally "see" the information from Layer $k-1$), MossTTSDelay uses **Delay-Pattern Scheduling**.
**The Pattern:**
At each training or inference step $t$, the input sequence is structured such that:
* Head 1 predicts Layer 1 of Frame $t$.
* Head 2 predicts Layer 2 of Frame $t-1$.
* Head 3 predicts Layer 3 of Frame $t-2$.
* ... and so on.
**Dependency Modeling:**
Because the Transformer is causal, when the model predicts tokens for "Step $t$", it has already seen the tokens from "Step $t-1$" in its context. Due to the 1-step shift, the information for Layer $k-1$ (at Step $t$) is already present in the history when the model predicts Layer $k$ (at Step $t+1$). This "diagonal" dependency effectively models the coarse-to-fine structure of the audio tokenizer.
---
## 5. Evaluation & Performance
According to the `moss_tts_model_card.md`, the **MossTTSDelay-8B** is the recommended model for production and long-form stability:
| Metric | Result (Seed-TTS-Eval) |
| :--- | :--- |
| **EN SIM (Speaker Similarity)** | **0.7146** |
| **ZH SIM (Speaker Similarity)** | **0.7705** |
| **EN WER (Word Error Rate)** | **1.79%** |
| **ZH CER (Char Error Rate)** | **1.32%** |
**Conclusion:** MossTTSDelay offers superior long-context stability and faster inference speeds compared to the Local variant. Its 8B parameter scale provides the capacity needed for complex prosody and ultra-long (up to 1 hour) speech generation.
---
## 6. Architecture Comparison
| Aspect | MossTTSDelay (Architecture A) | MossTTSLocal (Architecture B) |
| :--- | :--- | :--- |
| **Structure** | Single Transformer (8B) | Temporal + Depth Transformers (1.7B) |
| **Scheduling** | **Delay-Pattern (Diagonal Shift)** | Per-step Synchronous Blocks |
| **Prediction Heads** | **33 Parallel Heads** | Single Latent Head + Local Module |
| **Inference Speed** | **High** (Parallel RVQ prediction) | Moderate (Sequential RVQ prediction) |
| **Stability** | Excellent for long-form (1h+) | Optimized for short-segment metrics |
| **Best For** | Production, Scalable Apps, Narration | Research, Quality Benchmarks |
---