csm-experssiva

An experimental SFT fine-tune of CSM(Conversational Speech Model) with Expresso's 4th whispering voice. Quick spin-off to see if SFT LoRA tuning of the csm-mlx repository works well.

Trained on a MacBook Air M2 16GB with heavy swap usage, it took 0:43:47.

Two style checkpoints are present in the repository, ckpt.pt and ckpt.safetensors is for original PyTorch based CSM implementations. And mlx-ckpt.safetensors is for csm-mlx repository.

Note: Please use the speaker_id 4 while inferencing - since that's what model was trained with!

For original PyTorch based CSM implementations, changing the repository name shoud work - since all filenames are identical.

For csm-mlx, since filename is not ckpt.safetensors but mlx-ckpt.safetensors you should load the latter. Like this:

from mlx_lm.sample_utils import make_sampler
from huggingface_hub import hf_hub_download
from csm_mlx import CSM, csm_1b, generate

import audiofile
import numpy as np

csm = CSM(csm_1b())
weight = hf_hub_download(repo_id="senstella/csm-expressiva-1b", filename="mlx-ckpt.safetensors") # Here's the difference!
csm.load_weights(weight)

audio = generate(
    csm,
    text="Hello from Sesame.",
    speaker=4, # And this is another difference - please use 4 regardless of where you're inferencing!
    context=[],
    max_audio_length_ms=20_000,
    sampler=make_sampler(temp=0.8, top_k=50)
)

audiofile.write("./audio.wav", np.asarray(audio), 24000)

Some observations:

  • Small-set SFT somewhat mitigates CSM base model failure cases (Non-ending silence etc.)
    • It sometimes still fails, but much less frequently than before SFT tuning.
  • A small SFT run can easily copy the voice in nice detail.
  • Seems much stabler when quantized! (This was reported in this PR first!)

Hyperparameters used:

  • batch_size: 1
  • epoch: 1
  • first_codebook_weight_multiplier: 1.1
  • learning-rate: 1e-4
  • weight-decay: 1e-4
  • optimizer: adamw
  • lora-rank: 8
  • lora-alpha: 16
  • target-modules: attn, codebook0_head, projection

The future plan is to implement KTO on csm-mlx and further mitigate model failure cases using that approach.

Note

This model was fine-tuned to investigate whether the CSM-1b model exhibits emergent capacity to effectively compress and reconstruct whisper-style vocal features - something that traditional TTS models do not usually demonstrate. It also serves as a preliminary verification of the csm-mlx training setup and the correctness of its loss function. I want to make it clear that I do not endorse or encourage any inappropriate use of this model. Any unintended associations or interpretations do not reflect the intent behind this model.

License

Licence follows Expresso dataset's cc-by-nc-4.0, since it's trained from it!

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for senstella/csm-expressiva-1b

Base model

sesame/csm-1b
Finetuned
(12)
this model

Dataset used to train senstella/csm-expressiva-1b