Kartoffel-TTS (Based on Chatterbox) - German Text-to-Speech

Modell is still in development and was only trained on 600k samples without emotion classification on my 2 RTX 3090s. I am currently in the process of setting up more data (>2.5M) and classify the exaggeration.

Updates

The model has been rebuilt using Chatterbox, Resemble AI's open-source TTS framework. This allows for emotion exaggeration control and improved stability.

Model Overview

Kartoffel-TTS is a German text-to-speech (TTS) model family based on Chatterbox, designed for natural and expressive speech synthesis. The model supports emotion exaggeration control, and voice cloning.

Key Features:

Emotion Exaggeration Control: Adjust the intensity of emotions in speech, from subtle to dramatic.
Expressive Speech: Capable of producing speech with different emotional tones and expressions.
Fine-Tuned for German: Optimized for German language synthesis with a focus on naturalness and clarity.

Installation

Install the required libraries:

pip install chatterbox-tts

Usage Example

Here’s how to generate speech using Kartoffel-TTS:

import torch
import soundfile as sf
from chatterbox.tts import ChatterboxTTS
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

MODEL_REPO = "SebastianBodza/Kartoffelbox-v0.1" 
T3_CHECKPOINT_FILE = "t3_kartoffelbox.safetensors"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = ChatterboxTTS.from_pretrained(device=device)

print("Downloading and applying German patch...")
checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=T3_CHECKPOINT_FILE)

t3_state = load_file(checkpoint_path, device="cpu") 

model.t3.load_state_dict(t3_state)
print("Patch applied successfully.")


text = "Tief im verwunschenen Wald, wo die Bäume uralte Geheimnisse flüsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand."

reference_audio_path = "/content/uitoll.mp3"
output_path = "output_cloned_voice.wav"

print("Generating speech...")
with torch.inference_mode():
    wav = model.generate(
        text,
        audio_prompt_path=reference_audio_path,
        exaggeration=0.5, 
        temperature=0.6,  
        cfg_weight=0.3,  
    )

sf.write(output_path, wav.squeeze().cpu().numpy(), model.sr)
print(f"Audio saved to {output_path}")

Contributing

To improve the model further, additional high-quality German audio data with good transcripts are needed, especially for sounds like laughter, sighs, or other non-verbal expressions. Short audio clips (up to 60 seconds) with accurate transcriptions are particularly valuable.

For those with ideas or access to relevant data, collaboration opportunities are always welcome. Reach out to discuss potential contributions.

Acknowledgements

This model builds on the following technologies:

Chatterbox by Resemble AI
Cosyvoice
HiFT-GAN
Llama
S3Tokenizer

SebastianBodza
/

Kartoffelbox-v0.1

You need to agree to share your contact information to access this model