CSM Elise Voice Model LoRA

This model is a fine-tuned version of sesame/csm-1b using the Elise dataset with LoRA. There are sample outputs files in the repository.

The sound quality seems to be better than tuning on full-parameters. However, more tweaking would be needed to ensure consistent performance. From the sample we can hear two distinct sounds (soft and vibrant) when prompt differently. Also, model performance on larger tokens will be to be further validated.

Larger training data would be required for more consistent sound effect as the current dataset is small and limited.

Model Details

  • Base Model: sesame/csm-1b
  • Training Data: MrDragonFox/Elise dataset
  • Fine-tuning Approach: Voice cloning through conditional speech generation using LoRA
  • Voice Characteristics: [Describe voice qualities]
  • Training Parameters:
    • Learning Rate: 1e-5
    • Epochs: 4
    • Batch Size: 1 with gradient accumulation steps of 4

Quick Start

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from peft import PeftModel
import soundfile as sf
from IPython.display import Audio, display

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and processor
base_model_id = "sesame/csm-1b"
adapter_model_id = "keanteng/sesame-csm-elise-lora"  # your uploaded model

# Load processor
processor = AutoProcessor.from_pretrained(base_model_id)

# Load base model
base_model = CsmForConditionalGeneration.from_pretrained(
    base_model_id, 
    device_map=device,
    torch_dtype=torch.float16  # Use half precision for faster inference
)

# Load adapter and merge weights
model = PeftModel.from_pretrained(base_model, adapter_model_id)
model = model.merge_and_unload()  # Merge adapter weights into base model

# Optimize for generation
model.generation_config.max_length = 256
model.generation_config.use_cache = True
model.generation_config.cache_implementation = "static"

if hasattr(model, "depth_decoder"):
    model.depth_decoder.generation_config.cache_implementation = "static"
# Define a simple input
conversation = [
    {"role": "0", "content": [
        {"type": "text", "text": "Hello! I'm so happy to see you today!"}
    ]},
]

# Process input
inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

# Generate audio
audio = model.generate(**inputs, output_audio=True)

# Convert to numpy and save
audio_cpu = audio[0].to(torch.float32).cpu().numpy()
output_file = "output.wav"
sf.write(output_file, audio_cpu, 24000)

# Play audio if in notebook
try:
    display(Audio(output_file))
except:
    print(f"Audio saved to {output_file}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for keanteng/sesame-csm-elise-lora

Base model

sesame/csm-1b
Finetuned
(16)
this model

Dataset used to train keanteng/sesame-csm-elise-lora

Collection including keanteng/sesame-csm-elise-lora