CSM Elise Voice Model LoRA

This model is a fine-tuned version of sesame/csm-1b using the Elise dataset with LoRA. There are sample outputs files in the repository.

The sound quality seems to be better than tuning on full-parameters. However, more tweaking would be needed to ensure consistent performance. From the sample we can hear two distinct sounds (soft and vibrant) when prompt differently. Also, model performance on larger tokens will be to be further validated.

Larger training data would be required for more consistent sound effect as the current dataset is small and limited.

Model Details

Base Model: sesame/csm-1b
Training Data: MrDragonFox/Elise dataset
Fine-tuning Approach: Voice cloning through conditional speech generation using LoRA
Voice Characteristics: [Describe voice qualities]
Training Parameters:
- Learning Rate: 1e-5
- Epochs: 4
- Batch Size: 1 with gradient accumulation steps of 4

Quick Start

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from peft import PeftModel
import soundfile as sf
from IPython.display import Audio, display

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and processor
base_model_id = "sesame/csm-1b"
adapter_model_id = "keanteng/sesame-csm-elise-lora"  # your uploaded model

# Load processor
processor = AutoProcessor.from_pretrained(base_model_id)

# Load base model
base_model = CsmForConditionalGeneration.from_pretrained(
    base_model_id, 
    device_map=device,
    torch_dtype=torch.float16  # Use half precision for faster inference
)

# Load adapter and merge weights
model = PeftModel.from_pretrained(base_model, adapter_model_id)
model = model.merge_and_unload()  # Merge adapter weights into base model

# Optimize for generation
model.generation_config.max_length = 256
model.generation_config.use_cache = True
model.generation_config.cache_implementation = "static"

if hasattr(model, "depth_decoder"):
    model.depth_decoder.generation_config.cache_implementation = "static"

# Define a simple input
conversation = [
    {"role": "0", "content": [
        {"type": "text", "text": "Hello! I'm so happy to see you today!"}
    ]},
]

# Process input
inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

# Generate audio
audio = model.generate(**inputs, output_audio=True)

# Convert to numpy and save
audio_cpu = audio[0].to(torch.float32).cpu().numpy()
output_file = "output.wav"
sf.write(output_file, audio_cpu, 24000)

# Play audio if in notebook
try:
    display(Audio(output_file))
except:
    print(f"Audio saved to {output_file}")

keanteng
/

sesame-csm-elise-lora

CSM Elise Voice Model LoRA

Model Details

Quick Start

Model tree for keanteng/sesame-csm-elise-lora

Dataset used to train keanteng/sesame-csm-elise-lora

Collection including keanteng/sesame-csm-elise-lora

Project 🚀