After 4000 votes F5 TTS fell near the bottom of the leaderboard, I extracted some sample from Emilia. Let us see if that changes anything.
Yanis L PRO
AI & ML interests
Recent Activity
Organizations
Pendrokar's activity
stabilityai/stable-point-aware-3d
here's how it looks, with TRELLIS for comparison
If your data exceeds quantity & quality thresholds and is approved into the next hexgrad/Kokoro-82M training mix, and you permissively DM me the data under an effective Apache license, then I will DM back the corresponding voicepacks for YOUR data if/when the next Apache-licensed Kokoro base model drops.
What does this mean? If you've been calling closed-source TTS or audio API endpoints to:
- Build voice agents
- Make long-form audio, like audiobooks or podcasts
- Handle customer support, etc
Then YOU can contribute to the training mix and get useful artifacts in return. โค๏ธ
More details at hexgrad/Kokoro-82M#21
The original Arena's threshold is at 700 votes. But I am sure Kokoro will hold the position. The voice quality actually sounds close to ElevenLabs.
But StyleTTS usually is not very emotional. So it will fail where Edge TTS does. The phrases where the voice has to be sad or angry. For example Parler Expresso was overly jolly.
self.brag():
Kokoro finally got 300 votes in
Pendrokar/TTS-Spaces-Arena after
@Pendrokar
was kind enough to add it 3 weeks ago.Discounting the small sample size of votes, I think it is safe to say that hexgrad/Kokoro-TTS is currently a top 3 model among the contenders in that Arena. This is notable because:
- At 82M params, Kokoro is one of the smaller models in the Arena
- MeloTTS has 52M params
- F5 TTS has 330M params
- XTTSv2 has 467M params
It's expressive, punches way above its weight class, and supports voice cloning. Go check it out! ๐
(Unmute the audio sample below after hitting play)
True, a sample from the original dataset would probably be the best. My attempt to try to fetch one from Emilia dataset was unsuccessful as HF dataset viewer can only show the German samples. Emilia's homepage has a ASMR-y example prompt given.
True about the narration style sample, but that still did not stop XTTS in surpassing F5. Both use the same sample.
The voice sample used is the same as XTTS. F5 has so far been unstable, being unemotional/monotone/depressed and mispronouncing words (_awestruck_).
If you have suggestions please give feedback in the following thread:
mrfakename/E2-F5-TTS#32
Pendrokar/TTS-Spaces-Arena
Svngoku/maskgct-audio-lab
hexgrad/Kokoro-TTS
I chose @Svngoku 's forked HF space over amphion's due to the overly high ZeroGPU duration demand on the latter. 300s!
amphion/maskgct
Had to remove @mrfakename 's MetaVoice-1B Space from the available models as that space has been down for quite some time. ๐ค๏ธ
mrfakename/MetaVoice-1B-v0.1
I'm close to syncing the code to the original Arena's code structure. Then I'd like to use ASR in order to validate and create synthetic public datasets from the generated samples. And then make the Arena multilingual, which will surely attract quite the crowd!
Moonshine is a fast, efficient, & accurate ASR model released by Useful Sensors. It's designed for on-device inference and licensed under the MIT license!
HF Space (unofficial demo): mrfakename/Moonshine
GitHub repo for Moonshine: https://github.com/usefulsensors/moonshine
TTS-AGI/TTS-Arena's button for downloading the DB data was available for a short while. The reason for removal must have been the unreviewed user submitted entries within the spokentext
table. I've cleaned it up:
https://huggingface.co/datasets/Pendrokar/TTS_Arena_DB
Link to Spaces fork: https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena
Link to Original Arena: https://huggingface.co/spaces/TTS-AGI/TTS-Arena
Pendrokar/TTS-Spaces-Arena
Dual-licensed under MIT/Apache 2.0.
Model Weights: mrfakename/styletts2-detector
Spaces: mrfakename/styletts2-detector