Question

by jpgallegoar - opened 14 days ago

14 days ago

Thanks for the open source release, it looks great. I have a question, in the model card you say "For production or high-quality needs, I strongly recommend using llama.cpp for the best results." can you please explain what you mean? what benefits does llama.cpp bring? is it actually better quality or speed or what?

edwko

OuteAI org 13 days ago

Hi, it's a recommendation since llama.cpp supports required sampling by default, and seems to produce the best results with the model. I've described this in the notice "Important Sampling Considerations Across Different Backends."

jpgallegoar

12 days ago

Hi, it's a recommendation since llama.cpp supports required sampling by default, and seems to produce the best results with the model. I've described this in the notice "Important Sampling Considerations Across Different Backends."

Thank you for your reply @edwko . I have another question? Why did you change from WavTokenizer to DAC? As far as I can tell WavTokenizer is better, 40 tps at 24kHz instead of 75tps at 24kHz

edwko

OuteAI org 12 days ago

I've made this decision because DAC offers better reconstruction quality (but requires more TPS then WavTokenizer), which allows for improved voice cloning, also it handles multilingual audio without issues, whereas WavTokenizer is primarily trained on English data and struggles with multilingual reconstruction, so introducing support for 20 languages wouldn't be very feasible.

jpgallegoar

12 days ago

I've made this decision because DAC offers better reconstruction quality (but requires more TPS then WavTokenizer), which allows for improved voice cloning, also it handles multilingual audio without issues, whereas WavTokenizer is primarily trained on English data and struggles with multilingual reconstruction, so introducing support for 20 languages wouldn't be very feasible.

That's interesting, thanks for the explanation. I am particularly interested in multilingual reconstruction so DAC is a good path forward. You used the 1.5kbps version right? I'm guessing the 3kbps version wasn't that much different for double the tokens.

edwko

OuteAI org 12 days ago

Yeah, I used 1.5 kbps, it has 2 codebooks, so you can reconstruct audio at 150 TPS. The 3 kbps version would offer even better reconstruction, but it uses 4 codebooks, which means 300 TPS, a bit too many tokens for an autoregressive model to handle efficiently.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment