Vodex-Zen Multispeaker TTS
Model Details
Model Description
Vodex-Zen Multispeaker TTS is a multi-speaker expressive text-to-speech model powered by Vodex and designed for speaker-conditioned, emotionally expressive speech generation. This model leverages the SNAC (Scalable Neural Audio Codec) architecture to produce high-quality, natural-sounding speech with emotional expressiveness across multiple speaker voices.
The model supports various expressive tags such as <laugh>
, <sigh>
, <chuckle>
, and other emotional markers, enabling the generation of contextually appropriate and emotionally rich speech output. It has been specifically trained on three custom speaker voices: Ankita, Shweta, and Astha, providing users with diverse vocal characteristics and speaking styles.
- Developed by: athenasaurav
- Model type: Text-to-Speech (TTS) with multi-speaker support
- Language(s): English
- License: CC BY-NC 4.0 (Non-commercial use only)
- Finetuned from model: athenasaurav/Zen-TTS-v0.1-pretrained (Not Open Sourced)
Model Sources
- Repository: athenasaurav/Vodex-zen-multispeaker-en
- Base Model: athenasaurav/Zen-TTS-v0.1-pretrained
- Framework: Hugging Face Transformers with LLAMA+SNAC decoder
Uses
Direct Use
This model is designed for research and development purposes in the field of text-to-speech synthesis. It can be used to generate expressive speech from text input with speaker conditioning, making it suitable for applications requiring natural-sounding, emotionally expressive synthetic speech.
Downstream Use
The model can serve as a foundation for various speech synthesis applications, including:
- Voice assistants with emotional expression capabilities
- Audiobook narration with multiple character voices
- Educational content with engaging speech synthesis
- Accessibility tools for text-to-speech conversion
- Research in expressive speech synthesis
Out-of-Scope Use
This model is licensed for non-commercial research use only. Commercial applications, production deployments, or any use that generates revenue is explicitly prohibited under the CC BY-NC 4.0 license. Additionally, the model should not be used for creating misleading or deceptive audio content, impersonating real individuals without consent, or any malicious purposes.
Bias, Risks, and Limitations
Limitations
The model has several inherent limitations that users should be aware of:
Language Limitation: The model is trained exclusively on English text and speech data, limiting its applicability to English-language content only.
Speaker Limitation: The model supports only three predefined speakers (Ankita, Shweta, Astha), which may not represent the full diversity of human voices and speaking styles.
Expressive Tag Dependency: The model's emotional expressiveness relies on specific tags, which may not cover all possible emotional states or expressions.
Quality Variability: Speech quality may vary depending on input text complexity, length, and the presence of out-of-vocabulary words or unusual linguistic constructions.
Risks
Users should consider the following risks when working with this model:
Misuse for Deception: The high-quality speech synthesis capabilities could potentially be misused to create misleading audio content or impersonate individuals.
Bias in Training Data: The model may reflect biases present in the training data, potentially affecting the naturalness or appropriateness of generated speech for certain demographic groups or contexts.
Technical Dependencies: The model requires specific technical infrastructure and dependencies (SNAC decoder, CUDA support) which may limit accessibility and increase implementation complexity.
Recommendations
To mitigate these risks and limitations:
- Always disclose when audio content is synthetically generated
- Implement appropriate safeguards against misuse in applications
- Test the model thoroughly with diverse input texts before deployment
- Consider the ethical implications of synthetic speech generation in your use case
- Ensure compliance with relevant regulations and guidelines for synthetic media
How to Get Started with the Model
Installation
Before using the model, install the required dependencies:
pip install unsloth
pip install snac
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf huggingface_hub hf_transfer
pip install --no-deps unsloth
pip install snac
Basic Usage
import os
import torch
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from snac import SNAC
from IPython.display import Audio, display
# ===============================
# ๐ง Setup config
# ===============================
model_repo = "athenasaurav/Vodex-zen-multispeaker-en"
hf_token = "YOUR_HF_TOKEN" ## Replace this with your HF token
output_dir = "outputs1"
chosen_voice = "Ankita" # Or "Shweta" # Or "Astha"
prompts = [
"Seeing my daughter perform on stage <laugh> filled my heart with incredible pride!"
]
# ===============================
# ๐ Load model & tokenizer
# ===============================
tokenizer = AutoTokenizer.from_pretrained(model_repo, token=hf_token)
model = AutoModelForCausalLM.from_pretrained(
model_repo,
device_map="auto",
torch_dtype="auto",
token=hf_token,
load_in_4bit=False
)
# ===============================
# ๐ Load SNAC decoder
# ===============================
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model.to("cpu")
# ===============================
# ๐ Prompt preprocessing
# ===============================
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
padding_token = 128263
all_input_ids = []
for prompt in prompts_:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
all_input_ids.append(modified_input_ids)
# Pad to same length
max_length = max([ids.shape[1] for ids in all_input_ids])
padded_tensors, attention_masks = [], []
for ids in all_input_ids:
padding = max_length - ids.shape[1]
padded = torch.cat([torch.full((1, padding), padding_token, dtype=torch.int64), ids], dim=1)
mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones_like(ids)], dim=1)
padded_tensors.append(padded)
attention_masks.append(mask)
input_ids = torch.cat(padded_tensors).to("cuda")
attention_mask = torch.cat(attention_masks).to("cuda")
# ===============================
# ๐ Inference
# ===============================
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=4000,
do_sample=True,
temperature=0.3,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
)
# ===============================
# ๐ Postprocess audio tokens
# ===============================
token_to_find = 128257
token_to_remove = 128258
# Find where to start audio decoding
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_idx + 1:]
else:
cropped_tensor = generated_ids
# Remove eos token
processed_rows = [row[row != token_to_remove] for row in cropped_tensor]
# ===============================
# ๐ SNAC decode helper
# ===============================
def redistribute_codes(code_list):
layer_1, layer_2, layer_3 = [], [], []
for i in range((len(code_list)+1)//7):
layer_1.append(code_list[7*i])
layer_2.append(code_list[7*i+1] - 4096)
layer_3.append(code_list[7*i+2] - 8192)
layer_3.append(code_list[7*i+3] - 12288)
layer_2.append(code_list[7*i+4] - 16384)
layer_3.append(code_list[7*i+5] - 20480)
layer_3.append(code_list[7*i+6] - 24576)
codes = [torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0)]
return snac_model.decode(codes)
# ===============================
# ๐ง Decode audio and play/save
# ===============================
for i, row in enumerate(processed_rows):
row = row[: (len(row) // 7) * 7]
row = [token.item() - 128266 for token in row]
samples = redistribute_codes(row)
audio_np = samples.detach().squeeze().cpu().numpy()
# Save audio
if not os.path.exists(output_dir):
os.makedirs(output_dir)
out_path = f"{output_dir}/{chosen_voice}_{i}.wav"
sf.write(out_path, audio_np, samplerate=24000)
print(f"โ
Saved: {out_path}")
Training Details
Training Data
The model was fine-tuned from the base model athenasaurav/Zen-TTS-v0.1-pretrained
using multi-speaker English speech data. The training dataset included recordings from three speakers (Ankita, Shweta, Astha) with various emotional expressions and speaking styles to enable expressive speech synthesis.
Training Procedure
The model utilizes the Unsloth framework for efficient training and is built upon the Transformers architecture. The training process involved fine-tuning the base model to learn speaker-specific characteristics while maintaining the ability to generate expressive speech through special tokens and tags.
Training Hyperparameters
- Training regime: Fine-tuning from pre-trained checkpoint
- Framework: Unsloth with Transformers backend
- Audio codec: SNAC (Scalable Neural Audio Codec) at 24kHz
- Supported speakers: 3 (Ankita, Shweta, Astha)
Speeds, Sizes, Times
- Model size: Varies based on configuration (specific size not provided)
- Inference speed: Depends on hardware configuration and sequence length
- Audio sample rate: 24kHz output
Evaluation
Testing Data, Factors & Metrics
Specific evaluation metrics and testing procedures are not detailed in the provided information. Users are encouraged to evaluate the model's performance on their specific use cases and datasets.
Results
The model demonstrates capability in generating expressive multi-speaker speech with emotional tags. Qualitative assessment shows the model can produce natural-sounding speech with appropriate emotional expression when provided with suitable input prompts and expressive tags.
Technical Specifications
Model Architecture and Objective
The model is based on a causal language modeling approach adapted for speech synthesis, utilizing the SNAC audio codec for high-quality audio generation. It employs a transformer-based architecture fine-tuned for multi-speaker text-to-speech synthesis.
Compute Infrastructure
Hardware
- Inference: CUDA-compatible GPU recommended for optimal performance
Software
- Framework: Hugging Face Transformers
- Training library: Unsloth
- Audio codec: SNAC
- Dependencies: PyTorch, various supporting libraries as listed in installation requirements
Citation
If you use this model in your research or applications, please cite:
@misc{vodex-zen-multispeaker-2024,
author = {athenasaurav},
title = {Vodex-Zen Multispeaker TTS (SNAC-based)},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/athenasaurav/Vodex-zen-multispeaker-en}},
note = {Multi-speaker expressive text-to-speech model}
}
Model Card Authors
This model card was created by athenasaurav and formatted according to Hugging Face model card standards.
Model Card Contact
For questions, issues, or feedback regarding this model, please contact the model author through the Hugging Face model repository or relevant communication channels provided by the developer.
- Downloads last month
- 12