Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,247 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
pipeline_tag: text-to-speech
|
4 |
+
library_name: outetts
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
- zh
|
8 |
+
- nl
|
9 |
+
- fr
|
10 |
+
- ka
|
11 |
+
- de
|
12 |
+
- hu
|
13 |
+
- it
|
14 |
+
- ja
|
15 |
+
- ko
|
16 |
+
- lv
|
17 |
+
- pl
|
18 |
+
- ru
|
19 |
+
- es
|
20 |
+
---
|
21 |
+
<div class="p-4 bg-gray-50 dark:bg-gray-800 rounded-lg shadow-sm mb-12">
|
22 |
+
<div class="text-center mb-4">
|
23 |
+
<h2 class="text-xl font-light text-gray-900 dark:text-white tracking-tight mt-0 mb-0">Oute A I</h2>
|
24 |
+
<div class="flex justify-center gap-6 mt-4">
|
25 |
+
<a href="https://www.outeai.com/" target="_blank" class="flex items-center gap-1 text-gray-700 dark:text-gray-300 text-m font-medium hover:text-gray-900 dark:hover:text-white transition-colors underline">
|
26 |
+
<svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
|
27 |
+
<circle cx="12" cy="12" r="10"></circle>
|
28 |
+
<path d="M2 12h20M12 2a15.3 15.3 0 0 1 4 10 15.3 15.3 0 0 1-4 10 15.3 15.3 0 0 1-4-10 15.3 15.3 0 0 1 4-10z"></path>
|
29 |
+
</svg>
|
30 |
+
outeai.com
|
31 |
+
</a>
|
32 |
+
<a href="https://discord.gg/vyBM87kAmf" target="_blank" class="flex items-center gap-1 text-gray-700 dark:text-gray-300 text-m font-medium hover:text-gray-900 dark:hover:text-white transition-colors underline">
|
33 |
+
<svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
|
34 |
+
<path d="M21 11.5a8.38 8.38 0 0 1-.9 3.8 8.5 8.5 0 0 1-7.6 4.7 8.38 8.38 0 0 1-3.8-.9L3 21l1.9-5.7a8.38 8.38 0 0 1-.9-3.8 8.5 8.5 0 0 1 4.7-7.6 8.38 8.38 0 0 1 3.8-.9h.5a8.48 8.48 0 0 1 8 8v.5z"></path>
|
35 |
+
</svg>
|
36 |
+
Discord
|
37 |
+
</a>
|
38 |
+
<a href="https://x.com/OuteAI" target="_blank" class="flex items-center gap-1 text-gray-700 dark:text-gray-300 text-m font-medium hover:text-gray-900 dark:hover:text-white transition-colors underline">
|
39 |
+
<svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
|
40 |
+
<path d="M23 3a10.9 10.9 0 0 1-3.14 1.53 4.48 4.48 0 0 0-7.86 3v1A10.66 10.66 0 0 1 3 4s-4 9 5 13a11.64 11.64 0 0 1-7 2c9 5 20 0 20-11.5a4.5 4.5 0 0 0-.08-.83A7.72 7.72 0 0 0 23 3z"></path>
|
41 |
+
</svg>
|
42 |
+
@OuteAI
|
43 |
+
</a>
|
44 |
+
</div>
|
45 |
+
</div>
|
46 |
+
|
47 |
+
<div class="grid grid-cols-3 sm:grid-cols-3 gap-2">
|
48 |
+
<a href="https://huggingface.co/OuteAI/OuteTTS-1.0-0.6B" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
|
49 |
+
OuteTTS 1.0 0.6B
|
50 |
+
</a>
|
51 |
+
<a href="https://huggingface.co/OuteAI/OuteTTS-1.0-0.6B-FP8" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
|
52 |
+
OuteTTS 1.0 0.6B FP8
|
53 |
+
</a>
|
54 |
+
<a href="https://huggingface.co/OuteAI/OuteTTS-1.0-0.6B-GGUF" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
|
55 |
+
OuteTTS 1.0 0.6B GGUF
|
56 |
+
</a>
|
57 |
+
<a href="https://huggingface.co/OuteAI/OuteTTS-1.0-0.6B-EXL2-8bpw" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
|
58 |
+
OuteTTS 1.0 0.6B EXL2 8bpw
|
59 |
+
</a>
|
60 |
+
<a href="https://github.com/edwko/OuteTTS" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
|
61 |
+
GitHub Library
|
62 |
+
</a>
|
63 |
+
</div>
|
64 |
+
</div>
|
65 |
+
|
66 |
+
> [!IMPORTANT]
|
67 |
+
> **Important Sampling Considerations**
|
68 |
+
>
|
69 |
+
> When using OuteTTS version 1.0, it is crucial to use the settings specified in the [Sampling Configuration](#sampling-configuration) section.
|
70 |
+
> The **repetition penalty implementation** is particularly important - this model requires penalization applied to a **64-token recent window**,
|
71 |
+
> rather than across the entire context window. Penalizing the entire context will cause the model to produce **broken or low-quality output**.
|
72 |
+
>
|
73 |
+
> To address this limitation, all necessary samplers and patches for all backends are set up automatically in the **outetts** library.
|
74 |
+
> If using a custom implementation, ensure you correctly implement these requirements.
|
75 |
+
|
76 |
+
# OuteTTS Version 1.0
|
77 |
+
|
78 |
+
This update brings significant improvements in speech synthesis and voice cloning—delivering a more powerful, accurate, and user-friendly experience in a compact size.
|
79 |
+
|
80 |
+
## OuteTTS Python Package v0.4.2
|
81 |
+
|
82 |
+
New version adds **batched inference** generation with the latest OuteTTS release.
|
83 |
+
|
84 |
+
### ⚡ **Batched RTF Benchmarks**
|
85 |
+
Tested with **NVIDIA L40S GPU**
|
86 |
+
|
87 |
+

|
88 |
+
|
89 |
+
## Quick Start Guide
|
90 |
+
|
91 |
+
Getting started with **OuteTTS** is simple:
|
92 |
+
|
93 |
+
### Installation
|
94 |
+
|
95 |
+
🔗 [Installation instructions](https://github.com/edwko/OuteTTS?tab=readme-ov-file#installation)
|
96 |
+
|
97 |
+
### Basic Setup
|
98 |
+
```python
|
99 |
+
from outetts import Interface, ModelConfig, GenerationConfig, Backend, InterfaceVersion, Models, GenerationType
|
100 |
+
|
101 |
+
# Initialize the interface
|
102 |
+
interface = Interface(
|
103 |
+
ModelConfig.auto_config(
|
104 |
+
model=Models.VERSION_1_0_SIZE_0_6B,
|
105 |
+
backend=Backend.HF,
|
106 |
+
)
|
107 |
+
)
|
108 |
+
|
109 |
+
# Load the default **English** speaker profile
|
110 |
+
speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")
|
111 |
+
|
112 |
+
# Or create your own speaker (Use this once)
|
113 |
+
# speaker = interface.create_speaker("path/to/audio.wav")
|
114 |
+
# interface.save_speaker(speaker, "speaker.json")
|
115 |
+
|
116 |
+
# Load your speaker from saved file
|
117 |
+
# speaker = interface.load_speaker("speaker.json")
|
118 |
+
|
119 |
+
# Generate speech & save to file
|
120 |
+
output = interface.generate(
|
121 |
+
GenerationConfig(
|
122 |
+
text="Hello, how are you doing?",
|
123 |
+
speaker=speaker,
|
124 |
+
)
|
125 |
+
)
|
126 |
+
output.save("output.wav")
|
127 |
+
```
|
128 |
+
|
129 |
+
### ⚡ Batch Setup
|
130 |
+
```python
|
131 |
+
from outetts import Interface, ModelConfig, GenerationConfig, Backend, GenerationType
|
132 |
+
|
133 |
+
if __name__ == "__main__":
|
134 |
+
# Initialize the interface with a batch-capable backend
|
135 |
+
interface = Interface(
|
136 |
+
ModelConfig(
|
137 |
+
model_path="OuteAI/OuteTTS-1.0-0.6B-FP8",
|
138 |
+
tokenizer_path="OuteAI/OuteTTS-1.0-0.6B",
|
139 |
+
backend=Backend.VLLM
|
140 |
+
# For EXL2, use backend=Backend.EXL2ASYNC + exl2_cache_seq_multiply={should be same as max_batch_size in GenerationConfig}
|
141 |
+
# For LLAMACPP_ASYNC_SERVER, use backend=Backend.LLAMACPP_ASYNC_SERVER and provide server_host in GenerationConfig
|
142 |
+
)
|
143 |
+
)
|
144 |
+
|
145 |
+
# Load your speaker profile
|
146 |
+
speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL") # Or load/create custom speaker
|
147 |
+
|
148 |
+
# Generate speech using BATCH type
|
149 |
+
# Note: For EXL2ASYNC, VLLM, LLAMACPP_ASYNC_SERVER, BATCH is automatically selected.
|
150 |
+
output = interface.generate(
|
151 |
+
GenerationConfig(
|
152 |
+
text="This is a longer text that will be automatically split into chunks and processed in batches.",
|
153 |
+
speaker=speaker,
|
154 |
+
generation_type=GenerationType.BATCH,
|
155 |
+
max_batch_size=32, # Adjust based on your GPU memory and server capacity
|
156 |
+
dac_decoding_chunk=2048, # Adjust chunk size for DAC decoding
|
157 |
+
# If using LLAMACPP_ASYNC_SERVER, add:
|
158 |
+
# server_host="http://localhost:8000" # Replace with your server address
|
159 |
+
)
|
160 |
+
)
|
161 |
+
|
162 |
+
# Save to file
|
163 |
+
output.save("output_batch.wav")
|
164 |
+
```
|
165 |
+
|
166 |
+
### More Configuration Options
|
167 |
+
For advanced settings and customization, visit the official repository:
|
168 |
+
|
169 |
+
[](https://github.com/edwko/OuteTTS/blob/main/docs/interface_usage.md)
|
170 |
+
|
171 |
+
|
172 |
+
## Multilingual Capabilities
|
173 |
+
|
174 |
+
- **Trained Languages:** English, Chinese, Dutch, French, Georgian, German, Hungarian, Italian, Japanese, Korean, Latvian, Polish, Russian, Spanish
|
175 |
+
|
176 |
+
- **Beyond Supported Languages:** The model can generate speech in untrained languages with varying success. Experiment with unlisted languages, though results may not be optimal.
|
177 |
+
|
178 |
+
## Usage Recommendations
|
179 |
+
|
180 |
+
### Speaker Reference
|
181 |
+
The model is designed to be used with a speaker reference. Without one, it generates random vocal characteristics, often leading to lower-quality outputs.
|
182 |
+
The model inherits the referenced speaker's emotion, style, and accent.
|
183 |
+
When transcribing to other languages with the same speaker, you may observe the model retaining the original accent.
|
184 |
+
|
185 |
+
### Multilingual Application
|
186 |
+
It is recommended to create a speaker profile in the language you intend to use. This helps achieve the best results in that specific language, including tone, accent, and linguistic features.
|
187 |
+
|
188 |
+
While the model supports cross-lingual speech, it still relies on the reference speaker. If the speaker has a distinct accent—such as British English—other languages may carry that accent as well.
|
189 |
+
|
190 |
+
### Optimal Audio Length
|
191 |
+
- **Best Performance:** Generate audio around **42 seconds** in a single run (approximately 8,192 tokens). It is recomended not to near the limits of this windows when generating. Usually, the best results are up to 7,000 tokens.
|
192 |
+
- **Context Reduction with Speaker Reference:** If the speaker reference is 10 seconds long, the effective context is reduced to approximately 32 seconds.
|
193 |
+
|
194 |
+
### Temperature Setting Recommendations
|
195 |
+
Testing shows that a temperature of **0.4** is an ideal starting point for accuracy (with the sampling settings below). However, some voice references may benefit from higher temperatures for enhanced expressiveness or slightly lower temperatures for more precise voice replication.
|
196 |
+
|
197 |
+
### Verifying Speaker Encoding
|
198 |
+
If the cloned voice quality is subpar, check the encoded speaker sample.
|
199 |
+
|
200 |
+
```python
|
201 |
+
interface.decode_and_save_speaker(speaker=your_speaker, path="speaker.wav")
|
202 |
+
```
|
203 |
+
|
204 |
+
The DAC audio reconstruction model is lossy, and samples with clipping, excessive loudness, or unusual vocal features may introduce encoding issues that impact output quality.
|
205 |
+
|
206 |
+
### Sampling Configuration
|
207 |
+
For optimal results with this TTS model, use the following sampling settings.
|
208 |
+
|
209 |
+
| Parameter | Value |
|
210 |
+
|-------------------|----------|
|
211 |
+
| Temperature | 0.4 |
|
212 |
+
| Repetition Penalty| 1.1 |
|
213 |
+
| **Repetition Range** | **64** |
|
214 |
+
| Top-k | 40 |
|
215 |
+
| Top-p | 0.9 |
|
216 |
+
| Min-p | 0.05 |
|
217 |
+
|
218 |
+
## 📊 Model Specifications
|
219 |
+
|
220 |
+
| **Model** | **Training Data** | **Context Length** | **Supported Languages** |
|
221 |
+
|--------------------------|-----------------------------|--------------------|-------------------------|
|
222 |
+
| **Llama-OuteTTS-1.0-1B** | 60k hours of audio | 8,192 tokens | 23+ languages |
|
223 |
+
| **OuteTTS-1.0-0.6B** | 20k hours of audio | 8,192 tokens | 14+ languages |
|
224 |
+
|
225 |
+
## Acknowledgments
|
226 |
+
|
227 |
+
- Audio encoding and decoding utilize [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0)
|
228 |
+
- OuteTTS is built with [Qwen3 0.6B](https://huggingface.co/Qwen/Qwen3-0.6B-Base) as the base model, with continued pre-training and fine-tuning.
|
229 |
+
- Datasets used: [Multilingual LibriSpeech (MLS)](https://www.openslr.org/94/) ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)), [Common Voice Corpus](https://commonvoice.mozilla.org/en/datasets) ([CC-0](https://creativecommons.org/public-domain/cc0/))
|
230 |
+
|
231 |
+
---
|
232 |
+
|
233 |
+
### Ethical Use Guidelines
|
234 |
+
|
235 |
+
1. **Intended Purpose:** This model is intended for legitimate applications that
|
236 |
+
enhance accessibility, creativity, and communication.
|
237 |
+
|
238 |
+
2. **Prohibited Uses:**
|
239 |
+
* Impersonation of individuals without their explicit, informed consent.
|
240 |
+
* Creation of deliberately misleading, false, or deceptive content (e.g., "deepfakes" for malicious purposes).
|
241 |
+
* Generation of harmful, hateful, harassing, or defamatory material.
|
242 |
+
* Voice cloning of any individual without their explicit prior permission.
|
243 |
+
* Any uses that violate applicable local, national, or international laws, regulations, or copyrights.
|
244 |
+
|
245 |
+
3. **Responsibility:** Users are responsible for the content they generate and
|
246 |
+
how it is used. We encourage thoughtful consideration of the potential impact
|
247 |
+
of synthetic media.
|