kyutai
/

tts-voices

Model card Files Files and versions Community

tts-voices / README.md

lmz

Add EARS dataset (#2)

a8b99e8 verified 3 days ago

preview code

raw

history blame contribute delete

2.4 kB

	# Kyutai TTS voices

	Do you want more voices?
	Help us by [donating your voice](https://unmute.sh/voice-donation)
	or open an issue in the [TTS repo](https://github.com/kyutai-labs/delayed-streams-modeling/) to suggest permissively-licensed datasets of voices we could add here.

	## vctk/

	From the [Voice Cloning Toolkit](https://datashare.ed.ac.uk/handle/10283/3443) dataset,
	licensed under the Creative Commons License: Attribution 4.0 International.

	Each recording was done with two mics, here we used the `mic1` recordings.
	We chose sentence 23 for every speaker because it's generally the longest one to pronounce.

	## expresso/

	From the [Expresso](https://speechbot.github.io/expresso/) dataset,
	licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International.
	Non-commercial use only.

	We select clips from the "conversational" files.
	For each pair of "kind" and channel (`ex04-ex01_laughing`, channel 1),
	we find one segment with at least 10 consecutive seconds of speech using `VAD_segments.txt`.
	We don't include more segments per (kind, channel) to keep the number of voices manageable.

	The name of the file indicates how it was selected.
	For instance, `ex03-ex02_narration_001_channel1_674s.wav`
	comes from the first audio channel of `audio_48khz/conversational/ex03-ex02/narration/ex03-ex02_narration_001.wav`,
	meaning it's speaker `ex03`.
	It's a 10-second clip starting at 674 seconds of the original file.

	## cml-tts/fr/

	French voices selected from the [CML-TTS Dataset](https://openslr.org/146/),
	licensed under the Creative Commons License: Attribution 4.0 International.

	## ears/

	From the [EARS](https://sp-uhh.github.io/ears_dataset/) dataset,
	licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International.
	Non-commercial use only.

	For each of the 107 speakers, we use the middle 10 seconds of the `freeform_speech_01.wav` file.
	Additionally, we select two speakers, p003 (female) and p031 (male) and provide speaker embeddings for each of their `emo_*_freeform.wav` files.
	This is to allow users to experiment with having a voice of a single speaker with multiple emotions.

	## Computing voice embeddings (for Kyutai devs)

	```python
	uv run {root of `moshi` repo}/scripts/tts_make_voice.py \
	--model-root {path to weights dir}/moshi_1e68beda_240/ \
	--loudness-headroom 22 \
	{root of this repo}
	```