Vodex-Zen Multispeaker TTS

Model Details

Model Description

Vodex-Zen Multispeaker TTS is a multi-speaker expressive text-to-speech model powered by Vodex and designed for speaker-conditioned, emotionally expressive speech generation. This model leverages the SNAC (Scalable Neural Audio Codec) architecture to produce high-quality, natural-sounding speech with emotional expressiveness across multiple speaker voices.

The model supports various expressive tags such as <laugh>, <sigh>, <chuckle>, and other emotional markers, enabling the generation of contextually appropriate and emotionally rich speech output. It has been specifically trained on three custom speaker voices: Ankita, Shweta, and Astha, providing users with diverse vocal characteristics and speaking styles.

Developed by: athenasaurav
Model type: Text-to-Speech (TTS) with multi-speaker support
Language(s): English
License: CC BY-NC 4.0 (Non-commercial use only)
Finetuned from model: athenasaurav/Zen-TTS-v0.1-pretrained (Not Open Sourced)

Model Sources

Repository: athenasaurav/Vodex-zen-multispeaker-en
Base Model: athenasaurav/Zen-TTS-v0.1-pretrained
Framework: Hugging Face Transformers with LLAMA+SNAC decoder

Uses

Direct Use

This model is designed for research and development purposes in the field of text-to-speech synthesis. It can be used to generate expressive speech from text input with speaker conditioning, making it suitable for applications requiring natural-sounding, emotionally expressive synthetic speech.

Downstream Use

The model can serve as a foundation for various speech synthesis applications, including:

Voice assistants with emotional expression capabilities
Audiobook narration with multiple character voices
Educational content with engaging speech synthesis
Accessibility tools for text-to-speech conversion
Research in expressive speech synthesis

Out-of-Scope Use

This model is licensed for non-commercial research use only. Commercial applications, production deployments, or any use that generates revenue is explicitly prohibited under the CC BY-NC 4.0 license. Additionally, the model should not be used for creating misleading or deceptive audio content, impersonating real individuals without consent, or any malicious purposes.

Bias, Risks, and Limitations

Limitations

The model has several inherent limitations that users should be aware of:

Language Limitation: The model is trained exclusively on English text and speech data, limiting its applicability to English-language content only.
Speaker Limitation: The model supports only three predefined speakers (Ankita, Shweta, Astha), which may not represent the full diversity of human voices and speaking styles.
Expressive Tag Dependency: The model's emotional expressiveness relies on specific tags, which may not cover all possible emotional states or expressions.
Quality Variability: Speech quality may vary depending on input text complexity, length, and the presence of out-of-vocabulary words or unusual linguistic constructions.

Risks

Users should consider the following risks when working with this model:

Misuse for Deception: The high-quality speech synthesis capabilities could potentially be misused to create misleading audio content or impersonate individuals.
Bias in Training Data: The model may reflect biases present in the training data, potentially affecting the naturalness or appropriateness of generated speech for certain demographic groups or contexts.
Technical Dependencies: The model requires specific technical infrastructure and dependencies (SNAC decoder, CUDA support) which may limit accessibility and increase implementation complexity.

Recommendations

To mitigate these risks and limitations:

Always disclose when audio content is synthetically generated
Implement appropriate safeguards against misuse in applications
Test the model thoroughly with diverse input texts before deployment
Consider the ethical implications of synthetic speech generation in your use case
Ensure compliance with relevant regulations and guidelines for synthetic media

How to Get Started with the Model

Installation

Before using the model, install the required dependencies:

pip install unsloth
pip install snac
pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
pip install sentencepiece protobuf huggingface_hub hf_transfer
pip install --no-deps unsloth
pip install snac

Basic Usage

import os
import torch
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from snac import SNAC
from IPython.display import Audio, display

# ===============================
# 🔧 Setup config
# ===============================
model_repo = "athenasaurav/Vodex-zen-multispeaker-en"
hf_token = "YOUR_HF_TOKEN"      ## Replace this with your HF token
output_dir = "outputs1"
chosen_voice = "Ankita"  # Or "Shweta"  # Or "Astha"
prompts = [
    "Seeing my daughter perform on stage <laugh> filled my heart with incredible pride!"
]

# ===============================
# 🚀 Load model & tokenizer
# ===============================
tokenizer = AutoTokenizer.from_pretrained(model_repo, token=hf_token)
model = AutoModelForCausalLM.from_pretrained(
    model_repo,
    device_map="auto",
    torch_dtype="auto",
    token=hf_token,
    load_in_4bit=False
)

# ===============================
# 🔊 Load SNAC decoder
# ===============================
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model.to("cpu")

# ===============================
# 🛠 Prompt preprocessing
# ===============================
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
padding_token = 128263

all_input_ids = []
for prompt in prompts_:
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
    all_input_ids.append(modified_input_ids)

# Pad to same length
max_length = max([ids.shape[1] for ids in all_input_ids])
padded_tensors, attention_masks = [], []
for ids in all_input_ids:
    padding = max_length - ids.shape[1]
    padded = torch.cat([torch.full((1, padding), padding_token, dtype=torch.int64), ids], dim=1)
    mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones_like(ids)], dim=1)
    padded_tensors.append(padded)
    attention_masks.append(mask)

input_ids = torch.cat(padded_tensors).to("cuda")
attention_mask = torch.cat(attention_masks).to("cuda")

# ===============================
# 🔁 Inference
# ===============================
generated_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_new_tokens=4000,
    do_sample=True,
    temperature=0.3,
    top_p=0.95,
    repetition_penalty=1.1,
    num_return_sequences=1,
    eos_token_id=128258,
)

# ===============================
# 🔄 Postprocess audio tokens
# ===============================
token_to_find = 128257
token_to_remove = 128258

# Find where to start audio decoding
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
    last_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_idx + 1:]
else:
    cropped_tensor = generated_ids

# Remove eos token
processed_rows = [row[row != token_to_remove] for row in cropped_tensor]

# ===============================
# 🔄 SNAC decode helper
# ===============================
def redistribute_codes(code_list):
    layer_1, layer_2, layer_3 = [], [], []
    for i in range((len(code_list)+1)//7):
        layer_1.append(code_list[7*i])
        layer_2.append(code_list[7*i+1] - 4096)
        layer_3.append(code_list[7*i+2] - 8192)
        layer_3.append(code_list[7*i+3] - 12288)
        layer_2.append(code_list[7*i+4] - 16384)
        layer_3.append(code_list[7*i+5] - 20480)
        layer_3.append(code_list[7*i+6] - 24576)
    codes = [torch.tensor(layer_1).unsqueeze(0),
             torch.tensor(layer_2).unsqueeze(0),
             torch.tensor(layer_3).unsqueeze(0)]
    return snac_model.decode(codes)

# ===============================
# 🎧 Decode audio and play/save
# ===============================
for i, row in enumerate(processed_rows):
    row = row[: (len(row) // 7) * 7]
    row = [token.item() - 128266 for token in row]
    samples = redistribute_codes(row)
    audio_np = samples.detach().squeeze().cpu().numpy()

    # Save audio
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    out_path = f"{output_dir}/{chosen_voice}_{i}.wav"
    sf.write(out_path, audio_np, samplerate=24000)
    print(f"✅ Saved: {out_path}")

Training Details

Training Data

The model was fine-tuned from the base model athenasaurav/Zen-TTS-v0.1-pretrained using multi-speaker English speech data. The training dataset included recordings from three speakers (Ankita, Shweta, Astha) with various emotional expressions and speaking styles to enable expressive speech synthesis.

Training Procedure

The model utilizes the Unsloth framework for efficient training and is built upon the Transformers architecture. The training process involved fine-tuning the base model to learn speaker-specific characteristics while maintaining the ability to generate expressive speech through special tokens and tags.

Training Hyperparameters

Training regime: Fine-tuning from pre-trained checkpoint
Framework: Unsloth with Transformers backend
Audio codec: SNAC (Scalable Neural Audio Codec) at 24kHz
Supported speakers: 3 (Ankita, Shweta, Astha)

Speeds, Sizes, Times

Model size: Varies based on configuration (specific size not provided)
Inference speed: Depends on hardware configuration and sequence length
Audio sample rate: 24kHz output

Evaluation

Testing Data, Factors & Metrics

Specific evaluation metrics and testing procedures are not detailed in the provided information. Users are encouraged to evaluate the model's performance on their specific use cases and datasets.

Results

The model demonstrates capability in generating expressive multi-speaker speech with emotional tags. Qualitative assessment shows the model can produce natural-sounding speech with appropriate emotional expression when provided with suitable input prompts and expressive tags.

Technical Specifications

Model Architecture and Objective

The model is based on a causal language modeling approach adapted for speech synthesis, utilizing the SNAC audio codec for high-quality audio generation. It employs a transformer-based architecture fine-tuned for multi-speaker text-to-speech synthesis.

Compute Infrastructure

Hardware

Inference: CUDA-compatible GPU recommended for optimal performance

Software

Framework: Hugging Face Transformers
Training library: Unsloth
Audio codec: SNAC
Dependencies: PyTorch, various supporting libraries as listed in installation requirements

Citation

If you use this model in your research or applications, please cite:

@misc{vodex-zen-multispeaker-2024,
  author = {athenasaurav},
  title = {Vodex-Zen Multispeaker TTS (SNAC-based)},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/athenasaurav/Vodex-zen-multispeaker-en}},
  note = {Multi-speaker expressive text-to-speech model}
}

Model Card Authors

This model card was created by athenasaurav and formatted according to Hugging Face model card standards.

Model Card Contact

For questions, issues, or feedback regarding this model, please contact the model author through the Hugging Face model repository or relevant communication channels provided by the developer.

athenasaurav
/

Vodex-zen-multispeaker-en

You need to agree to share your contact information to access this model