Question

#2
by jpgallegoar - opened

Thanks for the open source release, it looks great. I have a question, in the model card you say "For production or high-quality needs, I strongly recommend using llama.cpp for the best results." can you please explain what you mean? what benefits does llama.cpp bring? is it actually better quality or speed or what?

OuteAI org

Hi, it's a recommendation since llama.cpp supports required sampling by default, and seems to produce the best results with the model. I've described this in the notice "Important Sampling Considerations Across Different Backends."

Hi, it's a recommendation since llama.cpp supports required sampling by default, and seems to produce the best results with the model. I've described this in the notice "Important Sampling Considerations Across Different Backends."

Thank you for your reply @edwko . I have another question? Why did you change from WavTokenizer to DAC? As far as I can tell WavTokenizer is better, 40 tps at 24kHz instead of 75tps at 24kHz

OuteAI org

I've made this decision because DAC offers better reconstruction quality (but requires more TPS then WavTokenizer), which allows for improved voice cloning, also it handles multilingual audio without issues, whereas WavTokenizer is primarily trained on English data and struggles with multilingual reconstruction, so introducing support for 20 languages wouldn't be very feasible.

I've made this decision because DAC offers better reconstruction quality (but requires more TPS then WavTokenizer), which allows for improved voice cloning, also it handles multilingual audio without issues, whereas WavTokenizer is primarily trained on English data and struggles with multilingual reconstruction, so introducing support for 20 languages wouldn't be very feasible.

That's interesting, thanks for the explanation. I am particularly interested in multilingual reconstruction so DAC is a good path forward. You used the 1.5kbps version right? I'm guessing the 3kbps version wasn't that much different for double the tokens.

OuteAI org

Yeah, I used 1.5 kbps, it has 2 codebooks, so you can reconstruct audio at 150 TPS. The 3 kbps version would offer even better reconstruction, but it uses 4 codebooks, which means 300 TPS, a bit too many tokens for an autoregressive model to handle efficiently.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment