|
# Kyutai TTS voices |
|
|
|
Do you want more voices? |
|
Help us by [donating your voice](https://unmute.sh/voice-donation) |
|
or open an issue in the [TTS repo](https://github.com/kyutai-labs/delayed-streams-modeling/) to suggest permissively-licensed datasets of voices we could add here. |
|
|
|
## vctk/ |
|
|
|
From the [Voice Cloning Toolkit](https://datashare.ed.ac.uk/handle/10283/3443) dataset, |
|
licensed under the Creative Commons License: Attribution 4.0 International. |
|
|
|
Each recording was done with two mics, here we used the `mic1` recordings. |
|
We chose sentence 23 for every speaker because it's generally the longest one to pronounce. |
|
|
|
## expresso/ |
|
|
|
From the [Expresso](https://speechbot.github.io/expresso/) dataset, |
|
licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International. |
|
**Non-commercial use only.** |
|
|
|
We select clips from the "conversational" files. |
|
For each pair of "kind" and channel (`ex04-ex01_laughing`, channel 1), |
|
we find one segment with at least 10 consecutive seconds of speech using `VAD_segments.txt`. |
|
We don't include more segments per (kind, channel) to keep the number of voices manageable. |
|
|
|
The name of the file indicates how it was selected. |
|
For instance, `ex03-ex02_narration_001_channel1_674s.wav` |
|
comes from the first audio channel of `audio_48khz/conversational/ex03-ex02/narration/ex03-ex02_narration_001.wav`, |
|
meaning it's speaker `ex03`. |
|
It's a 10-second clip starting at 674 seconds of the original file. |
|
|
|
## cml-tts/fr/ |
|
|
|
French voices selected from the [CML-TTS Dataset](https://openslr.org/146/), |
|
licensed under the Creative Commons License: Attribution 4.0 International. |
|
|
|
## ears/ |
|
|
|
From the [EARS](https://sp-uhh.github.io/ears_dataset/) dataset, |
|
licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International. |
|
**Non-commercial use only.** |
|
|
|
For each of the 107 speakers, we use the middle 10 seconds of the `freeform_speech_01.wav` file. |
|
Additionally, we select two speakers, p003 (female) and p031 (male) and provide speaker embeddings for each of their `emo_*_freeform.wav` files. |
|
This is to allow users to experiment with having a voice of a single speaker with multiple emotions. |
|
|
|
## Computing voice embeddings (for Kyutai devs) |
|
|
|
```python |
|
uv run {root of `moshi` repo}/scripts/tts_make_voice.py \ |
|
--model-root {path to weights dir}/moshi_1e68beda_240/ \ |
|
--loudness-headroom 22 \ |
|
{root of this repo} |
|
``` |
|
|
|
|