
Kartoffel-3B (Based on Orpheus-3B) - Synthetic
Model Overview
This is a German text-to-speech (TTS) model family based on Orpheus-3B.
Two main versions are available:
- Kartoffel-3B-Natural: Fine-tuned primarily on natural human speech recordings, aiming for realistic voices. The dataset is based on, high-quality German audio, including permissive podcasts, lectures, and other OER data that were processed with an Emilia styled pipeline.
- Kartoffel-3B-Synthetic: Fine-tuned using synthetic speech data, with emotions and different outbursts. The dataset consists of a diverse set of emotions with 4 different speeakers.
This is currently the synthetic version for synthetic sounding speakers, but with added emotion and outburst support.
Both versions support:
- Multiple Speakers: The model can generate speech using various speaker identities from predefined speakers.
- Varied Expressions: Capable of generating speech with different emotional tones and expressions based on the input text.
Available Speakers & Expressions for the synthetic Version:
Speakers:
- Martin
- Luca
- Anne
- Emma
Emotions:
To add emotions the following ones are used:
- Neutral
- Happy
- Sad
- Excited
- Surprised
- Humorous
- Angry
- Calm
- Disgust
- Fear
- Proud
- Romantic
To use them add them behind the speaker name like [Speaker_name] - [Emotion]: [German text]
for example for the speaker Martin and the Emotion sad, the correct template would be:
Martin - Sad: Oh ich bin sooo traurig.
Outbursts:
The following outbursts are working:
- haha
- ughh
- wow
- wuhuuu
- ohhh
You can either directly use them in the text or place them in tags. Keep in mind to use the exact text from the variations.
Inference
import torch
import torchaudio.transforms as T
import os
import torch
from snac import SNAC
from peft import PeftModel
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
chosen_voice = "Martin"
prompts = [
'Tief im verwunschenen Wald, wo die Bรคume uralte Geheimnisse flรผsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand.',
]
def process_single_prompt(prompt, chosen_voice):
if chosen_voice == "in_prompt" or chosen_voice == "":
full_prompt = prompt
else:
full_prompt = f"{chosen_voice}: {prompt}"
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
input_ids = modified_input_ids.to("cuda")
attention_mask = torch.ones_like(input_ids)
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=4000,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
use_cache=True,
)
token_to_find = 128257
token_to_remove = 128258
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx + 1 :]
else:
cropped_tensor = generated_ids
masked_row = cropped_tensor[0][cropped_tensor[0] != token_to_remove]
row_length = masked_row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = masked_row[:new_length]
code_list = [t - 128266 for t in trimmed_row]
return code_list
def redistribute_codes(code_list):
layer_1 = []
layer_2 = []
layer_3 = []
for i in range((len(code_list) + 1) // 7):
layer_1.append(code_list[7 * i])
layer_2.append(code_list[7 * i + 1] - 4096)
layer_3.append(code_list[7 * i + 2] - (2 * 4096))
layer_3.append(code_list[7 * i + 3] - (3 * 4096))
layer_2.append(code_list[7 * i + 4] - (4 * 4096))
layer_3.append(code_list[7 * i + 5] - (5 * 4096))
layer_3.append(code_list[7 * i + 6] - (6 * 4096))
codes = [
torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0),
]
codes = [c.to("cuda") for c in codes]
audio_hat = snac_model.decode(codes)
return audio_hat
for i, prompt in enumerate(prompts):
print(f"Processing prompt {i + 1}/{len(prompts)}")
with torch.no_grad():
code_list = process_single_prompt(prompt, chosen_voice)
samples = redistribute_codes(code_list)
audio_numpy = samples.detach().squeeze().to("cpu").numpy()
sf.write(f"output_{i}.wav", audio_numpy, 24000)
print(f"Saved output_{i}.wav")
- Downloads last month
- 100
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1
Base model
meta-llama/Llama-3.2-3B-Instruct