hexgrad/Kokoro-82M · [Status] Jan 21 Update

Owner 17 days ago

•

Jan 21: A new Kokoro model should land in about a week, towards the end of January. Expect more voices and more languages, exact details TBD, although support for some languages may be thin due to low data volume or G2P issues. Initial results look promising, but we will see if loss continues to decrease.

Setback: Unfortunately, Korean had to be withdrawn from the training set due to G2P issues. Because Kokoro shares tokens (phonemes) across languages, G2P accuracy is paramount. There are ways around this—improving the G2P, or dedicating a few dozen tokens to just Korean—but none feasible in time for the next model.

Cost Projection: The first released Kokoro model (v0.19) cost about $400 in A100 80GB GPU time. Including that previous amount, this next model will likely come out to around $1000 in total cost.

I have been unable to post an Article on HF due to being "Rate Limited". If and when this lifts, I would like to write at least one Article.

Jan 12: My intent is to supersede v0.19 with a better Kokoro model that dominates in every respect. To do this, I plan to continue training the unreleased v0.23 checkpoint on a richer data mix.

If successful, you should expect the next-gen Kokoro model to ship with more voices and languages, also under an Apache 2.0 license, with a similar 82M parameter architecture.
If unsuccessful, it would most likely be because the model does not converge, i.e. loss does not go down. That could be because of data quality issues, architecture limitations, overfitting on old data, underfitting on new data, etc. Rollbacks and model collapse are not unheard of in ML, but fingers crossed it does not happen here—or if they do, that I can address such issues should they come up.

Behind the scenes, slabs of data have been (and still are) coming in thanks to the outstanding community response to #21 and I am incredibly grateful for it. Some of these slabs are languages new to the model, which is exciting. Note that #21 is first-come-first-serve, and at some point I will not be able to airdrop your data into a GPU in the middle of a training run.

Most of my focus is now on organizing these slabs such that they can be dispatched to GPUs later. Training has not started yet, since data is still flowing in and much processing work remains. In the meantime, I may not be able to get to some of your questions, but please understand that is not without reason.

That's it for now, thanks everyone!

hexgrad pinned discussion 17 days ago

freefallr

16 days ago

Keep up the amazing work! Kokoro is a godsend to those who have been waiting for a license-permissive high quality TTS model for so long.

mysticaltech

16 days ago

This is inspiring work! You are singlehandedly changing the game. God bless and always follow your vision 🙏

cchance27

16 days ago

Great to hear, question is there any chance that Koko will be able to eventually handle things like breath sounds, coughs, and those.... interupts... that normal speech have

shub1

16 days ago

not only is apache 2 a good call but im a huge fan of the 82m param size! Amazing work!!! ❤️❤️❤️

ps: could we have the updated discord server link, it says no longer working.

hexgrad

Owner 16 days ago

ps: could we have the updated discord server link, it says no longer working.

@shub1 The discord server link works, someone else had this issue earlier and said "Its a firefox problem refusing to open Discord" so maybe try another browser or switch to mobile.

Bluebomber182

15 days ago

Do you plan to include an emotion option. IE have the AI voice talk angrily or happily, etc.

yukiarimo

15 days ago

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

hexgrad

Owner 14 days ago

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

@yukiarimo Gave it a fair shot. Exact copy and paste into https://huggingface.co/spaces/mrfakename/MeloTTS

FariqF

14 days ago

Man, release the model quickly.

yukiarimo

13 days ago

LMAO. Pre-trained models don’t count. If you have a professional, real studio-quality dataset, it will sound undistinguishable from human (AND THE NORMAL SAMPLING RATE, LIKE REALLY?????? (Yes, I know that most people are deaf, but it is really audible).

Plus, you said you generated the dataset with ElevenLabs and other TTSs? Well, it definitely sounds very robotic and non-human!

@FariqF Yeah, or at least the encoder part. Really, who really cares about your weight (basic users do not count)? It is not an LLM where it is just too expensive to "drop the Wikipedia and go;" it is a TTS!

arc-r

13 days ago

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

It's so disrespectful to call someone's hard work “absolute trash”.
I like MeloTTS too, but I have to disagree with you.

yukiarimo

13 days ago

Everything I don’t like I call so. For example, ElevenLabs: Good? Yes! Absolute trash? Yes! Why:

Not open-source
No real voice generation control
Watermark. For everything
Shitty ToS
PVC verification
No RVC-like conversion
Can’t generate >5k at once
Full WAV = X2 credits
No training config / Limited data input
Doesn’t sound natural enough

FariqF

13 days ago

The decision to make a model open source ultimately rests with its owner or creator.

yukiarimo

13 days ago

Yes, I know. Just add it to make a list of 10

yukiarimo

13 days ago

It is not a big deal is you just share an architecture, but not the actual weights

hexgrad

Owner 13 days ago

It is not a big deal is you just share an architecture, but not the actual weights

@yukiarimo The architecture was already open sourced by Li et al:

Paper: https://arxiv.org/abs/2306.07691
Repository: https://github.com/yl4579/StyleTTS2

Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.

Beyond that, some unsolicited advice: it would serve you well in life to be a little more respectful to people who give you free things. If you were to go to your hypothetical mother-in-law's house, eat a cooked meal, then spit in it and scream "It's not good enough! Make me another!" I don't think that would play out well for you.

fireblade2534

13 days ago

•

edited 13 days ago

LMAO. Pre-trained models don’t count. If you have a professional, real studio-quality dataset, it will sound undistinguishable from human (AND THE NORMAL SAMPLING RATE, LIKE REALLY?????? (Yes, I know that most people are deaf, but it is really audible).

Plus, you said you generated the dataset with ElevenLabs and other TTSs? Well, it definitely sounds very robotic and non-human!

@FariqF Yeah, or at least the encoder part. Really, who really cares about your weight (basic users do not count)? It is not an LLM where it is just too expensive to "drop the Wikipedia and go;" it is a TTS!

@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.

fireblade2534

13 days ago

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.

miromad

13 days ago

@yukiarimo The architecture was already open sourced by Li et al:

Paper: https://arxiv.org/abs/2306.07691

Repository: https://github.com/yl4579/StyleTTS2

Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.

Hello, hexgrad,
First, I want to congratulate you on the amazing work you’ve done and thank you for making the TTS model you’re sharing with us open source!
I’d like to kindly ask if it’s possible for you to share the settings you use to train your models and also share the modified code, if that’s possible?
I want to train a model that speaks Bulgarian because most existing models currently sound very robotic and unnatural. I have high-quality audio and will do my best to train the model as well as I can.
Thank you in advance!

yukiarimo

13 days ago

•

edited 13 days ago

@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.

Because MeloTTS is 200M and this one is 84M, so if it is possible to make “equal sounding” TTSs, I would like to go with one which generates faster and uses less memory (assuming 1-1 identical generations)

@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.

I didn't! We have a voice actress in our headquarters who records stuff for datasets. All dataset are human spoken (even for LLMs, I've never used shared LLM-generated stuff)

@yukiarimo The architecture was already open sourced by Li et al. Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.

Yes, but:

What about the unreleased encoder part? Where is it came from? Also StyleTTS2?
StyleTTS2 is slow and needs a lot of data unlike Kokoro and Melo that can be trained with a few hours!

fireblade2534

13 days ago

@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.

Because MeloTTS is 200M and this one is 84M, so if it is possible to make “equal sounding” TTSs, I would like to go with one which generates faster and uses less memory (assuming 1-1 identical generations)

@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.

I didn't! We have a voice actress in our headquarters who records stuff for datasets. All dataset are human spoken (even for LLMs, I've never used shared LLM-generated stuff)

@yukiarimo The architecture was already open sourced by Li et al. Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.

Yes, but:

What about the unreleased encoder part? Where is it came from? Also StyleTTS2?

StyleTTS2 is slow and needs a lot of data unlike Kokoro and Melo that can be trained with a few hours!

Then why don't you train Melo?

yukiarimo

13 days ago

Already in the process. Just trying to experiment with different stuff to see what’s better !

flameborn

12 days ago

Love how you degrade someone's hard work to the level of "absolute trash", point out that you have already switched, now you come back to say that yes, even though it's absolute trash, you still experiment with it because it gives you faster results using less memory. All this gives you wonderful credibility and reputation of course. Might I suggest you reevaluate what (or who) is absolute trash here?

yukiarimo

12 days ago

Yes, of course! Do you want to know what is absolute trash here? That’s right -> it’s ElevenLabs who are putting their fucking watermark on every single clip!

fireblade2534

12 days ago

•

edited 12 days ago

Yes, of course! Do you want to know what is absolute trash here? That’s right -> it’s ElevenLabs who are putting their fucking watermark on every single clip!

Then go complain to eleven labs. At the end of the day hexgrad is the dev of the model. He can do what he wants with it. The code for style TTS 2 is available online and you can try to recreate what hex did.

MrWiseman

11 days ago

@yukiarimo : I tried several TTS models in local few month back and MeloTTS was a shitshow !
Documentation on the architecture is "we took every VITS" and empty corporate bullshits on myshell.ai.
Kokoro is trending hard on social networks and attract attentions. Jealousy is hideous.
"We have a voice actress in our headquarters". You have an headquarters, one voice actress so you are a professional with real studio-quality dataset who like to crap on another man work ?
@yukiarimo : Put this into your melo "ɡəʊ fʌk jɔːˈsɛlf"

davidmoore-io

11 days ago

Just wanted to say I've been playing with your model today and it's really excellent. Being able to run it in browser is kind of amazing. Really impressive, thanks for your hard work on it.

sipvoip

11 days ago

Amazing work; one question: Do you have any idea why at least 10 token chunks are faster on a 3080 than on a 4090?

uncomplexity

11 days ago

thank you for this, you are doing god's work!

uncomplexity

11 days ago

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

@yukiarimo Gave it a fair shot. Exact copy and paste into https://huggingface.co/spaces/mrfakename/MeloTTS

LMAO this app still free

i am not paying huggingface enough!!!!

Meshwa

11 days ago

Great to hear, question is there any chance that Koko will be able to eventually handle things like breath sounds, coughs, and those.... interupts... that normal speech have

It would be nice to have that. Let's hope we get it in the new model!

mukhtharcm

10 days ago

Just wanted to say @hexgrad don't mind those negative comments.
You're doing great work!
Thank you for this!

RyanJames

10 days ago

Sometimes I wish I could block useless troll comments from an otherwise professional and non-toxic discussion. :(

kevonline

9 days ago

can we expect Turkish and Arabic as one of the new languages?

sangwon1472

9 days ago

•

edited 9 days ago

Thank you so much, and I sincerely look forward to other versions of your open-source work!!

yepic11

8 days ago

I might we contribute and help make the Arabic one

hexgrad changed discussion title from [STATUS] Jan 12 Forecast to [Status] Jan 21 Update 8 days ago

sikaro

8 days ago

•

edited 8 days ago

Jan 21: A new Kokoro model should land in about a week, towards the end of January. Expect more voices and more languages, exact details TBD, although support for some languages may be thin due to low data volume or G2P issues. Initial results look promising, but we will see if loss continues to decrease.

Setback: Unfortunately, Korean had to be withdrawn from the training set due to G2P issues. Because Kokoro shares tokens (phonemes) across languages, G2P accuracy is paramount. There are ways around this—improving the G2P, or dedicating a few dozen tokens to just Korean—but none feasible in time for the next model.

Cost Projection: The first released Kokoro model (v0.19) cost about $400 in A100 80GB GPU time. Including that previous amount, this next model will likely come out to around $1000 in total cost.

I have been unable to post an Article on HF due to being "Rate Limited". If and when this lifts, I would like to write at least one Article.

--> I'm the one who really waited for korean.
Is there any kind of more detail about the make this problem solve? i would rather help then giveup.

or could you at least please share the version of trained in 0.23v? most of koreans were very impresived about it.

Meshwa

8 days ago

What about effects like laugh, cough or lip smack? Are the included with the model?

Does it have support for words like uh, uhm or ummm or hmm?

It would be nice to have those 😅.

Will this release have the encoder part opensourced?

uncomplexity

8 days ago

What about effects like laugh, cough or lip smack? Are the included with the model?

Does it have support for words like uh, uhm or ummm or hmm?

It would be nice to have those 😅.

Will this release have the encoder part opensourced?

yo just honest question, on the cough, how do they do these?

aggregate a bunch of recordings of call center agents with cough and colds, who are forced to work because they ran out of sick leaves, as training data?

Meshwa

8 days ago

Hmm, 🤔.

hexgrad

Owner 8 days ago

#69

sangwon1472

7 days ago

•

edited 7 days ago

I am someone who has been truly waiting for Korean.
could you share the version trained in 0.23v?
Most Koreans were very impressed by it.

0.23v is very good!!

rdh

6 days ago

Since you're looking into adding new languages, feel free to use this Dutch dataset:

yepic11

6 days ago

Would anyone like to collaborate with me on an Arabic version? We have budget to spend

hanswang1973

6 days ago

Is there anywhere where I can make donations to you?

hexgrad

Owner 5 days ago

Is there anywhere where I can make donations to you?

I was asked this question on Discord a few times as well, so I uploaded a DONATE.md file here: https://hf.co/hexgrad/Kokoro-82M/blob/main/DONATE.md

I currently do not own an NVIDIA GPU, so if I'm not paying for cloud GPUs, I'm relegated to T4 GPUs (15 GB vRAM). That is why "Usage" is typically a single cell that runs in Colab.

I can train Kokoro-type models on T4, but it requires a much lower compute training configuration, takes way longer and the loss generally does not converge as low (as A100). In other words, the model won't sound as good.

junkstage

3 days ago

•

edited 3 days ago

@hexgrad
just a hint/question.... if training cost's you about $1k maybe you should think of buying one of these "Nvidia Digit" minicomputers - when they are available this march?
From what i have read they should cost you round about $3k - but should be 3times faster than the A100 and have more RAM available. (at least that's what NVIDIA claims on their website).
Should be much cheaper on the long run than renting cloud gpu's - or am i missing something here?

P.S: You are really doing great innovational stuff here !! Very impressive!! I'am hoping to see language support for german here as well some day. ❤️

yepic11

2 days ago

My Donation would be based on an Arabic version being made.

yukiarimo

2 days ago

How is it even possible to spend $1k on a cloud GPU? What configuration?

sipvoip

2 days ago

heheh, it's pretty easy, even with vast.ai

bergutman

1 day ago

af_bella is super impressive. Amazing work @hexgrad !