Kartoffel-3B (Based on Orpheus-3B) - Synthetic

Updates

I have added a blog post that includes code on how to use Kartoffel for near real-time streaming. Check it out here: Blog Post.

Model Overview

This is a German text-to-speech (TTS) model family based on Orpheus-3B.

Two main versions are available:

Kartoffel-3B-Natural: Fine-tuned primarily on natural human speech recordings, aiming for realistic voices. The dataset is based on, high-quality German audio, including permissive podcasts, lectures, and other OER data that were processed with an Emilia styled pipeline.
Kartoffel-3B-Synthetic: Fine-tuned using synthetic speech data, with emotions and different outbursts. The dataset consists of a diverse set of emotions with 4 different speeakers.

This is currently the synthetic version for synthetic sounding speakers, but with added emotion and outburst support.

Both versions support:

Multiple Speakers: The model can generate speech using various speaker identities from predefined speakers.
Varied Expressions: Capable of generating speech with different emotional tones and expressions based on the input text.

Available Speakers & Expressions for the synthetic Version:

Speakers:

Martin
Luca
Anne
Emma

Emotions:

To add emotions the following ones are used:

Neutral
Happy
Sad
Excited
Surprised
Humorous
Angry
Calm
Disgust
Fear
Proud
Romantic

To use them add them behind the speaker name like [Speaker_name] - [Emotion]: [German text] for example for the speaker Martin and the Emotion sad, the correct template would be:

Martin - Sad: Oh ich bin sooo traurig.

Outbursts:

The following outbursts are working:

haha
ughh
wow
wuhuuu
ohhh

You can either directly use them in the text or place them in tags. Keep in mind to use the exact text from the variations.

Inference

import torch
import torchaudio.transforms as T
import os
import torch
from snac import SNAC

from peft import PeftModel
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained(
    "SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    "SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
)

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")

chosen_voice = "Martin"

prompts = [
    'Tief im verwunschenen Wald, wo die Bäume uralte Geheimnisse flüsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand.',
]

def process_single_prompt(prompt, chosen_voice):
    if chosen_voice == "in_prompt" or chosen_voice == "":
        full_prompt = prompt
    else:
        full_prompt = f"{chosen_voice}: {prompt}"
    start_token = torch.tensor([[128259]], dtype=torch.int64)
    end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)

    input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids
    modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)

    input_ids = modified_input_ids.to("cuda")
    attention_mask = torch.ones_like(input_ids)

    generated_ids = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=4000,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        repetition_penalty=1.1,
        num_return_sequences=1,
        eos_token_id=128258,
        use_cache=True,
    )

    token_to_find = 128257
    token_to_remove = 128258

    token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

    if len(token_indices[1]) > 0:
        last_occurrence_idx = token_indices[1][-1].item()
        cropped_tensor = generated_ids[:, last_occurrence_idx + 1 :]
    else:
        cropped_tensor = generated_ids

    masked_row = cropped_tensor[0][cropped_tensor[0] != token_to_remove]
    row_length = masked_row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = masked_row[:new_length]
    code_list = [t - 128266 for t in trimmed_row]

    return code_list


def redistribute_codes(code_list):
    layer_1 = []
    layer_2 = []
    layer_3 = []
    for i in range((len(code_list) + 1) // 7):
        layer_1.append(code_list[7 * i])
        layer_2.append(code_list[7 * i + 1] - 4096)
        layer_3.append(code_list[7 * i + 2] - (2 * 4096))
        layer_3.append(code_list[7 * i + 3] - (3 * 4096))
        layer_2.append(code_list[7 * i + 4] - (4 * 4096))
        layer_3.append(code_list[7 * i + 5] - (5 * 4096))
        layer_3.append(code_list[7 * i + 6] - (6 * 4096))

    codes = [
        torch.tensor(layer_1).unsqueeze(0),
        torch.tensor(layer_2).unsqueeze(0),
        torch.tensor(layer_3).unsqueeze(0),
    ]
    codes = [c.to("cuda") for c in codes]

    audio_hat = snac_model.decode(codes)
    return audio_hat


for i, prompt in enumerate(prompts):
    print(f"Processing prompt {i + 1}/{len(prompts)}")
    with torch.no_grad():
        code_list = process_single_prompt(prompt, chosen_voice)
        samples = redistribute_codes(code_list)


    audio_numpy = samples.detach().squeeze().to("cpu").numpy()
    sf.write(f"output_{i}.wav", audio_numpy, 24000)
    print(f"Saved output_{i}.wav")

SebastianBodza
/

Kartoffel_Orpheus-3B_german_synthetic-v0.1

You need to agree to share your contact information to access this model