
Kartoffel-3B (Based on Orpheus-3B) - Natural
Model Overview
This is a German text-to-speech (TTS) model family based on Orpheus-3B.
Two main versions are available:
- Kartoffel-3B-Natural: Fine-tuned primarily on natural human speech recordings, aiming for realistic voices. The dataset is based on, high-quality German audio, including permissive podcasts, lectures, and other OER data that were processed with an Emilia styled pipeline.
- Kartoffel-3B-Synthetic: Fine-tuned using synthetic speech data, with emotions and different outbursts. The dataset consists of a diverse set of emotions with 4 different speeakers.
This is currently the natural version for natural sounding speakers.
Both versions support:
- Multiple Speakers: The model can generate speech using various speaker identities from predefined speakers.
- Varied Expressions: Capable of generating speech with different emotional tones and expressions based on the input text. The natural version has limited support for expressions and emotions.
Available Speakers & Expressions for the Natural Version:
Speakers:
There are a couple of speakers, however not all are stable. Therefore I will only list following the speakers that are at least partially stable:
Jakob
Anton
Julian
Jan
Alexander
Emil
Ben
Elias
Felix
Jonas
Noah
Maximilian
Sophie
Marie
Mia
Maria
Sophia
Lina
Lea
Unfortunately the dataset had alot more male then female speakers. Also not all speakers could be reconstructed and duplicates can be present. The gender estimation also worked kind of bad.
import torch
import torchaudio.transforms as T
import os
import torch
from snac import SNAC
from peft import PeftModel
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
chosen_voice = "Julian"
prompts = [
'Tief im verwunschenen Wald, wo die Bรคume uralte Geheimnisse flรผsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand.',
]
def process_single_prompt(prompt, chosen_voice):
if chosen_voice == "in_prompt" or chosen_voice == "":
full_prompt = prompt
else:
full_prompt = f"{chosen_voice}: {prompt}"
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
input_ids = modified_input_ids.to("cuda")
attention_mask = torch.ones_like(input_ids)
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=4000,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
use_cache=True,
)
token_to_find = 128257
token_to_remove = 128258
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx + 1 :]
else:
cropped_tensor = generated_ids
masked_row = cropped_tensor[0][cropped_tensor[0] != token_to_remove]
row_length = masked_row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = masked_row[:new_length]
code_list = [t - 128266 for t in trimmed_row]
return code_list
def redistribute_codes(code_list):
layer_1 = []
layer_2 = []
layer_3 = []
for i in range((len(code_list) + 1) // 7):
layer_1.append(code_list[7 * i])
layer_2.append(code_list[7 * i + 1] - 4096)
layer_3.append(code_list[7 * i + 2] - (2 * 4096))
layer_3.append(code_list[7 * i + 3] - (3 * 4096))
layer_2.append(code_list[7 * i + 4] - (4 * 4096))
layer_3.append(code_list[7 * i + 5] - (5 * 4096))
layer_3.append(code_list[7 * i + 6] - (6 * 4096))
codes = [
torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0),
]
codes = [c.to("cuda") for c in codes]
audio_hat = snac_model.decode(codes)
return audio_hat
for i, prompt in enumerate(prompts):
print(f"Processing prompt {i + 1}/{len(prompts)}")
with torch.no_grad():
code_list = process_single_prompt(prompt, chosen_voice)
samples = redistribute_codes(code_list)
audio_numpy = samples.detach().squeeze().to("cpu").numpy()
sf.write(f"output_{i}.wav", audio_numpy, 24000)
print(f"Saved output_{i}.wav")
- Downloads last month
- 183
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for SebastianBodza/Kartoffel_Orpheus-3B_german_natural-v0.1
Base model
meta-llama/Llama-3.2-3B-Instruct