Support evaluations of other languages?

#10
by laubonghaudoi - opened

Currently all test cases are in English and models are evaluated on their English performance. This misses the multilingual abilities of many TTS models. E.g. XTTS supports 16 languages. Multilingual ability is one important dimension of TTS models, and we can't assume the English performance of a model can be transferred to other languages. It would be very valuable if we can evaluate the performance of non-English languages.

I plan to add the capability. Not sure what random sentences to use.

Could go with the ones on Common Voice. All those have been evaluated by the public.
https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/tree/main/transcript

Thank you! Common Voice definitely is a very good choice. All text sentences are in the public domain and the sentences are validated by volunteers. We can sample the sentences from the validated portion as test cases. We just need to be aware that some sentences might not be normalized (with special symbols and unreadable words) and some sentences are too short or too long (sentences with only one single word or hundreds of words). As long as we filter those dirty samples the remaining ones should be very good for TTS test cases.

Well at first I wanted to hardcode in another language. The Japanese TTS Arena, which is also a clone of TTS Arena, by @kotoba-tech.
https://huggingface.co/spaces/kotoba-tech/TTS-Arena-JA

Has been down for quite a few weeks. Pinging @lihaoxin2020 @arumaekawa @kojimano3 @jungok to prod their interest in having a multilingual TTS Arena. This also means the top TTS models gets even more scrutiny and challenged. Then the Leaderboard would have language filters. I wanted text style filters too, but those will have to come later.

As text is one thing, the other is getting native voices. Both fine-tuned models and voice samples for zero-shot TTS. That is quite a bit a work to do alone... 😡

I don't think we need to get copies of fine-tuned models? I thought the arena is meant to benchmark the performance of the base model, not various downstream adapted models. If a model doesn't support a language, we can just say it fails in this dimension, which is still a useful piece of information and an indication of the model capability.

Code synched #11 , easier structure for other devs to comprehend.

But I feel #7 is required before proceeding with multilingual Arena. Because I expect the popularity of this Space to really skyrocket, which means my account HF token's daily allowance will be diminished quickly on new audio. Getting caches samples from a synthetic dataset is a priority.

Sign up or log in to comment