Kartoffel-TTS (Based on Chatterbox) - German Text-to-Speech
Modell is still in development and was only trained on 600k samples without emotion classification on my 2 RTX 3090s. I am currently in the process of setting up more data (>2.5M) and classify the exaggeration.
Updates
- The model has been rebuilt using Chatterbox, Resemble AI's open-source TTS framework. This allows for emotion exaggeration control and improved stability.
Model Overview
Kartoffel-TTS is a German text-to-speech (TTS) model family based on Chatterbox, designed for natural and expressive speech synthesis. The model supports emotion exaggeration control, and voice cloning.
Key Features:
- Emotion Exaggeration Control: Adjust the intensity of emotions in speech, from subtle to dramatic.
- Expressive Speech: Capable of producing speech with different emotional tones and expressions.
- Fine-Tuned for German: Optimized for German language synthesis with a focus on naturalness and clarity.
Installation
Install the required libraries:
pip install chatterbox-tts
Usage Example
Here’s how to generate speech using Kartoffel-TTS:
import torch
import soundfile as sf
from chatterbox.tts import ChatterboxTTS
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
MODEL_REPO = "SebastianBodza/Kartoffelbox-v0.1"
T3_CHECKPOINT_FILE = "t3_kartoffelbox.safetensors"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = ChatterboxTTS.from_pretrained(device=device)
print("Downloading and applying German patch...")
checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=T3_CHECKPOINT_FILE)
t3_state = load_file(checkpoint_path, device="cpu")
model.t3.load_state_dict(t3_state)
print("Patch applied successfully.")
text = "Tief im verwunschenen Wald, wo die Bäume uralte Geheimnisse flüsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand."
reference_audio_path = "/content/uitoll.mp3"
output_path = "output_cloned_voice.wav"
print("Generating speech...")
with torch.inference_mode():
wav = model.generate(
text,
audio_prompt_path=reference_audio_path,
exaggeration=0.5,
temperature=0.6,
cfg_weight=0.3,
)
sf.write(output_path, wav.squeeze().cpu().numpy(), model.sr)
print(f"Audio saved to {output_path}")
Contributing
To improve the model further, additional high-quality German audio data with good transcripts are needed, especially for sounds like laughter, sighs, or other non-verbal expressions. Short audio clips (up to 60 seconds) with accurate transcriptions are particularly valuable.
For those with ideas or access to relevant data, collaboration opportunities are always welcome. Reach out to discuss potential contributions.
Acknowledgements
This model builds on the following technologies:
- Chatterbox by Resemble AI
- Cosyvoice
- HiFT-GAN
- Llama
- S3Tokenizer
Model tree for SebastianBodza/Kartoffelbox-v0.1
Base model
ResembleAI/chatterbox