Project π
Collection
Some self-explaration works on transformer: audio, images, text, ...
β’
2 items
β’
Updated
β’
1
This model is a fine-tuned version of sesame/csm-1b using the Elise dataset with LoRA. There are sample outputs files in the repository.
The sound quality seems to be better than tuning on full-parameters. However, more tweaking would be needed to ensure consistent performance. From the sample we can hear two distinct sounds (soft and vibrant) when prompt differently. Also, model performance on larger tokens will be to be further validated.
Larger training data would be required for more consistent sound effect as the current dataset is small and limited.
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from peft import PeftModel
import soundfile as sf
from IPython.display import Audio, display
# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load model and processor
base_model_id = "sesame/csm-1b"
adapter_model_id = "keanteng/sesame-csm-elise-lora" # your uploaded model
# Load processor
processor = AutoProcessor.from_pretrained(base_model_id)
# Load base model
base_model = CsmForConditionalGeneration.from_pretrained(
base_model_id,
device_map=device,
torch_dtype=torch.float16 # Use half precision for faster inference
)
# Load adapter and merge weights
model = PeftModel.from_pretrained(base_model, adapter_model_id)
model = model.merge_and_unload() # Merge adapter weights into base model
# Optimize for generation
model.generation_config.max_length = 256
model.generation_config.use_cache = True
model.generation_config.cache_implementation = "static"
if hasattr(model, "depth_decoder"):
model.depth_decoder.generation_config.cache_implementation = "static"
# Define a simple input
conversation = [
{"role": "0", "content": [
{"type": "text", "text": "Hello! I'm so happy to see you today!"}
]},
]
# Process input
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
# Generate audio
audio = model.generate(**inputs, output_audio=True)
# Convert to numpy and save
audio_cpu = audio[0].to(torch.float32).cpu().numpy()
output_file = "output.wav"
sf.write(output_file, audio_cpu, 24000)
# Play audio if in notebook
try:
display(Audio(output_file))
except:
print(f"Audio saved to {output_file}")
Base model
sesame/csm-1b