[Status] Jan 21 Update
Jan 21: A new Kokoro model should land in about a week, towards the end of January. Expect more voices and more languages, exact details TBD, although support for some languages may be thin due to low data volume or G2P issues. Initial results look promising, but we will see if loss continues to decrease.
Setback: Unfortunately, Korean had to be withdrawn from the training set due to G2P issues. Because Kokoro shares tokens (phonemes) across languages, G2P accuracy is paramount. There are ways around this—improving the G2P, or dedicating a few dozen tokens to just Korean—but none feasible in time for the next model.
Cost Projection: The first released Kokoro model (v0.19) cost about $400 in A100 80GB GPU time. Including that previous amount, this next model will likely come out to around $1000 in total cost.
I have been unable to post an Article on HF due to being "Rate Limited". If and when this lifts, I would like to write at least one Article.
Jan 12: My intent is to supersede v0.19 with a better Kokoro model that dominates in every respect. To do this, I plan to continue training the unreleased v0.23 checkpoint on a richer data mix.
- If successful, you should expect the next-gen Kokoro model to ship with more voices and languages, also under an Apache 2.0 license, with a similar 82M parameter architecture.
- If unsuccessful, it would most likely be because the model does not converge, i.e. loss does not go down. That could be because of data quality issues, architecture limitations, overfitting on old data, underfitting on new data, etc. Rollbacks and model collapse are not unheard of in ML, but fingers crossed it does not happen here—or if they do, that I can address such issues should they come up.
Behind the scenes, slabs of data have been (and still are) coming in thanks to the outstanding community response to #21 and I am incredibly grateful for it. Some of these slabs are languages new to the model, which is exciting. Note that #21 is first-come-first-serve, and at some point I will not be able to airdrop your data into a GPU in the middle of a training run.
Most of my focus is now on organizing these slabs such that they can be dispatched to GPUs later. Training has not started yet, since data is still flowing in and much processing work remains. In the meantime, I may not be able to get to some of your questions, but please understand that is not without reason.
That's it for now, thanks everyone!
Keep up the amazing work! Kokoro is a godsend to those who have been waiting for a license-permissive high quality TTS model for so long.
This is inspiring work! You are singlehandedly changing the game. God bless and always follow your vision 🙏
Great to hear, question is there any chance that Koko will be able to eventually handle things like breath sounds, coughs, and those.... interupts... that normal speech have
not only is apache 2 a good call but im a huge fan of the 82m param size! Amazing work!!! ❤️❤️❤️
ps: could we have the updated discord server link, it says no longer working.
Do you plan to include an emotion option. IE have the AI voice talk angrily or happily, etc.
Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!
Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!
@yukiarimo
Gave it a fair shot. Exact copy and paste into https://huggingface.co/spaces/mrfakename/MeloTTS
Man, release the model quickly.
LMAO. Pre-trained models don’t count. If you have a professional, real studio-quality dataset, it will sound undistinguishable from human (AND THE NORMAL SAMPLING RATE, LIKE REALLY?????? (Yes, I know that most people are deaf, but it is really audible).
Plus, you said you generated the dataset with ElevenLabs and other TTSs? Well, it definitely sounds very robotic and non-human!
@FariqF Yeah, or at least the encoder part. Really, who really cares about your weight (basic users do not count)? It is not an LLM where it is just too expensive to "drop the Wikipedia and go;" it is a TTS!
Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!
It's so disrespectful to call someone's hard work “absolute trash”.
I like MeloTTS too, but I have to disagree with you.
Everything I don’t like I call so. For example, ElevenLabs: Good? Yes! Absolute trash? Yes! Why:
- Not open-source
- No real voice generation control
- Watermark. For everything
- Shitty ToS
- PVC verification
- No RVC-like conversion
- Can’t generate >5k at once
- Full WAV = X2 credits
- No training config / Limited data input
- Doesn’t sound natural enough
The decision to make a model open source ultimately rests with its owner or creator.
Yes, I know. Just add it to make a list of 10
- It is not a big deal is you just share an architecture, but not the actual weights
- It is not a big deal is you just share an architecture, but not the actual weights
@yukiarimo The architecture was already open sourced by Li et al:
- Paper: https://arxiv.org/abs/2306.07691
- Repository: https://github.com/yl4579/StyleTTS2
Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.
Beyond that, some unsolicited advice: it would serve you well in life to be a little more respectful to people who give you free things. If you were to go to your hypothetical mother-in-law's house, eat a cooked meal, then spit in it and scream "It's not good enough! Make me another!" I don't think that would play out well for you.
LMAO. Pre-trained models don’t count. If you have a professional, real studio-quality dataset, it will sound undistinguishable from human (AND THE NORMAL SAMPLING RATE, LIKE REALLY?????? (Yes, I know that most people are deaf, but it is really audible).
Plus, you said you generated the dataset with ElevenLabs and other TTSs? Well, it definitely sounds very robotic and non-human!
@FariqF Yeah, or at least the encoder part. Really, who really cares about your weight (basic users do not count)? It is not an LLM where it is just too expensive to "drop the Wikipedia and go;" it is a TTS!
@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.
Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!
@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.
@yukiarimo The architecture was already open sourced by Li et al:
- Paper: https://arxiv.org/abs/2306.07691
- Repository: https://github.com/yl4579/StyleTTS2
Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.
Hello, hexgrad,
First, I want to congratulate you on the amazing work you’ve done and thank you for making the TTS model you’re sharing with us open source!
I’d like to kindly ask if it’s possible for you to share the settings you use to train your models and also share the modified code, if that’s possible?
I want to train a model that speaks Bulgarian because most existing models currently sound very robotic and unnatural. I have high-quality audio and will do my best to train the model as well as I can.
Thank you in advance!
@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.
Because MeloTTS is 200M and this one is 84M, so if it is possible to make “equal sounding” TTSs, I would like to go with one which generates faster and uses less memory (assuming 1-1 identical generations)
@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.
I didn't! We have a voice actress in our headquarters who records stuff for datasets. All dataset are human spoken (even for LLMs, I've never used shared LLM-generated stuff)
@yukiarimo The architecture was already open sourced by Li et al. Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.
Yes, but:
- What about the unreleased encoder part? Where is it came from? Also StyleTTS2?
- StyleTTS2 is slow and needs a lot of data unlike Kokoro and Melo that can be trained with a few hours!
@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.
Because MeloTTS is 200M and this one is 84M, so if it is possible to make “equal sounding” TTSs, I would like to go with one which generates faster and uses less memory (assuming 1-1 identical generations)
@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.
I didn't! We have a voice actress in our headquarters who records stuff for datasets. All dataset are human spoken (even for LLMs, I've never used shared LLM-generated stuff)
@yukiarimo The architecture was already open sourced by Li et al. Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.
Yes, but:
- What about the unreleased encoder part? Where is it came from? Also StyleTTS2?
- StyleTTS2 is slow and needs a lot of data unlike Kokoro and Melo that can be trained with a few hours!
Then why don't you train Melo?
Already in the process. Just trying to experiment with different stuff to see what’s better !
Love how you degrade someone's hard work to the level of "absolute trash", point out that you have already switched, now you come back to say that yes, even though it's absolute trash, you still experiment with it because it gives you faster results using less memory. All this gives you wonderful credibility and reputation of course. Might I suggest you reevaluate what (or who) is absolute trash here?
Yes, of course! Do you want to know what is absolute trash here? That’s right -> it’s ElevenLabs who are putting their fucking watermark on every single clip!
Yes, of course! Do you want to know what is absolute trash here? That’s right -> it’s ElevenLabs who are putting their fucking watermark on every single clip!
Then go complain to eleven labs. At the end of the day hexgrad is the dev of the model. He can do what he wants with it. The code for style TTS 2 is available online and you can try to recreate what hex did.
@yukiarimo
: I tried several TTS models in local few month back and MeloTTS was a shitshow !
Documentation on the architecture is "we took every VITS" and empty corporate bullshits on myshell.ai.
Kokoro is trending hard on social networks and attract attentions. Jealousy is hideous.
"We have a voice actress in our headquarters". You have an headquarters, one voice actress so you are a professional with real studio-quality dataset who like to crap on another man work ?
@yukiarimo
: Put this into your melo "ɡəʊ fʌk jɔːˈsɛlf"
Just wanted to say I've been playing with your model today and it's really excellent. Being able to run it in browser is kind of amazing. Really impressive, thanks for your hard work on it.
thank you for this, you are doing god's work!
Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!
@yukiarimo Gave it a fair shot. Exact copy and paste into https://huggingface.co/spaces/mrfakename/MeloTTS
LMAO this app still free
i am not paying huggingface enough!!!!
Great to hear, question is there any chance that Koko will be able to eventually handle things like breath sounds, coughs, and those.... interupts... that normal speech have
It would be nice to have that. Let's hope we get it in the new model!
Just wanted to say
@hexgrad
don't mind those negative comments.
You're doing great work!
Thank you for this!
Sometimes I wish I could block useless troll comments from an otherwise professional and non-toxic discussion. :(
can we expect Turkish and Arabic as one of the new languages?
Thank you so much, and I sincerely look forward to other versions of your open-source work!!
I might we contribute and help make the Arabic one
Jan 21: A new Kokoro model should land in about a week, towards the end of January. Expect more voices and more languages, exact details TBD, although support for some languages may be thin due to low data volume or G2P issues. Initial results look promising, but we will see if loss continues to decrease.
Setback: Unfortunately, Korean had to be withdrawn from the training set due to G2P issues. Because Kokoro shares tokens (phonemes) across languages, G2P accuracy is paramount. There are ways around this—improving the G2P, or dedicating a few dozen tokens to just Korean—but none feasible in time for the next model.
Cost Projection: The first released Kokoro model (v0.19) cost about $400 in A100 80GB GPU time. Including that previous amount, this next model will likely come out to around $1000 in total cost.
I have been unable to post an Article on HF due to being "Rate Limited". If and when this lifts, I would like to write at least one Article.
--> I'm the one who really waited for korean.
Is there any kind of more detail about the make this problem solve? i would rather help then giveup.
or could you at least please share the version of trained in 0.23v? most of koreans were very impresived about it.
What about effects like laugh, cough or lip smack? Are the included with the model?
Does it have support for words like uh, uhm or ummm or hmm?
It would be nice to have those 😅.
Will this release have the encoder part opensourced?
What about effects like laugh, cough or lip smack? Are the included with the model?
Does it have support for words like uh, uhm or ummm or hmm?
It would be nice to have those 😅.
Will this release have the encoder part opensourced?
yo just honest question, on the cough, how do they do these?
aggregate a bunch of recordings of call center agents with cough and colds, who are forced to work because they ran out of sick leaves, as training data?
Hmm, 🤔.
I am someone who has been truly waiting for Korean.
could you share the version trained in 0.23v?
Most Koreans were very impressed by it.
0.23v is very good!!
Since you're looking into adding new languages, feel free to use this Dutch dataset:
Would anyone like to collaborate with me on an Arabic version? We have budget to spend
Is there anywhere where I can make donations to you?
Is there anywhere where I can make donations to you?
I was asked this question on Discord a few times as well, so I uploaded a DONATE.md
file here: https://hf.co/hexgrad/Kokoro-82M/blob/main/DONATE.md
I currently do not own an NVIDIA GPU, so if I'm not paying for cloud GPUs, I'm relegated to T4 GPUs (15 GB vRAM). That is why "Usage" is typically a single cell that runs in Colab.
I can train Kokoro-type models on T4, but it requires a much lower compute training configuration, takes way longer and the loss generally does not converge as low (as A100). In other words, the model won't sound as good.
@hexgrad
just a hint/question.... if training cost's you about $1k maybe you should think of buying one of these "Nvidia Digit" minicomputers - when they are available this march?
From what i have read they should cost you round about $3k - but should be 3times faster than the A100 and have more RAM available. (at least that's what NVIDIA claims on their website).
Should be much cheaper on the long run than renting cloud gpu's - or am i missing something here?
P.S: You are really doing great innovational stuff here !! Very impressive!! I'am hoping to see language support for german here as well some day. ❤️
My Donation would be based on an Arabic version being made.
How is it even possible to spend $1k on a cloud GPU? What configuration?
heheh, it's pretty easy, even with vast.ai