You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

You agree to not use the model to generate contents that violate DMCA or local laws.

FishAudio S1

FishAudio S1 is a leading text-to-speech (TTS) model trained on more than 2 million hours of audio data in multiple languages.

Supported languages:

English (en)
Chinese (zh)
Japanese (ja)
German (de)
French (fr)
Spanish (es)
Korean (ko)
Arabic (ar)
Russian (ru)
Dutch (nl)
Italian (it)
Polish (pl)
Portuguese (pt)

Please refer to Fish Speech Github for more info. Demo available at Fish Audio Playground. Visit the Fish Audio website for blog & tech report.

Emotion and Tone Support

FishAudio S1 supports a variety of emotional, tone, and special markers to enhance speech synthesis:

1. Emotional markers: (angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused)

2. Tone markers: (in a hurry tone) (shouting) (screaming) (whispering) (soft tone)

3. Special markers: (laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing)

Special markers with corresponding onomatopoeia:

Laughing: Ha,ha,ha
Chuckling: Hmm,hmm

Model Variants and Performance

FishAudio S1 includes the following models:

S1 (4B, proprietary): The full-sized model.
S1-mini (0.5B): A distilled version of S1.

Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).

Seed TTS Eval Metrics (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM):

S1:
- WER (Word Error Rate): 0.008
- CER (Character Error Rate): 0.004
- Distance: 0.332
S1-mini:
- WER (Word Error Rate): 0.011
- CER (Character Error Rate): 0.005
- Distance: 0.380

License

This model is permissively licensed under the CC-BY-NC-SA-4.0 license.

Downloads last month: 5,121

Model tree for fishaudio/s1-mini

Finetunes

2 models

Quantizations

1 model

Spaces using fishaudio/s1-mini 29

Collection including fishaudio/s1-mini

Fish Audio S1

Collection

2 items • Updated 13 days ago • 1