# Kyutai TTS voices

Do you want more voices?
Help us by [donating your voice](https://unmute.sh/voice-donation)
or open an issue in the [TTS repo](https://github.com/kyutai-labs/delayed-streams-modeling/) to suggest permissively-licensed datasets of voices we could add here.

## vctk/

From the [Voice Cloning Toolkit](https://datashare.ed.ac.uk/handle/10283/3443) dataset,
licensed under the Creative Commons License: Attribution 4.0 International.

Each recording was done with two mics, here we used the `mic1` recordings.
We chose sentence 23 for every speaker because it's generally the longest one to pronounce.

## expresso/

From the [Expresso](https://speechbot.github.io/expresso/) dataset,
licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International.
**Non-commercial use only.**

We select clips from the "conversational" files.
For each pair of "kind" and channel (`ex04-ex01_laughing`, channel 1),
we find one segment with at least 10 consecutive seconds of speech using `VAD_segments.txt`.
We don't include more segments per (kind, channel) to keep the number of voices manageable.

The name of the file indicates how it was selected.
For instance, `ex03-ex02_narration_001_channel1_674s.wav` 
comes from the first audio channel of `audio_48khz/conversational/ex03-ex02/narration/ex03-ex02_narration_001.wav`,
meaning it's speaker `ex03`.
It's a 10-second clip starting at 674 seconds of the original file.

## cml-tts/fr/

French voices selected from the [CML-TTS Dataset](https://openslr.org/146/),
licensed under the Creative Commons License: Attribution 4.0 International.

## ears/

From the [EARS](https://sp-uhh.github.io/ears_dataset/) dataset,
licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International.
**Non-commercial use only.**

For each of the 107 speakers, we use the middle 10 seconds of the `freeform_speech_01.wav` file.
Additionally, we select two speakers, p003 (female) and p031 (male) and provide speaker embeddings for each of their `emo_*_freeform.wav` files.
This is to allow users to experiment with having a voice of a single speaker with multiple emotions.

## Computing voice embeddings (for Kyutai devs)

```python
uv run {root of `moshi` repo}/scripts/tts_make_voice.py \
    --model-root {path to weights dir}/moshi_1e68beda_240/ \
    --loudness-headroom 22 \
    {root of this repo}
```