senstella/csm-expressiva-1b

csm-experssiva

An experimental SFT fine-tune of CSM(Conversational Speech Model) with Expresso's 4th whispering voice. Quick spin-off to see if SFT LoRA tuning of the csm-mlx repository works well.

Trained on a MacBook Air M2 16GB with heavy swap usage, it took 0:43:47.

Two style checkpoints are present in the repository, ckpt.pt and ckpt.safetensors is for original PyTorch based CSM implementations. And mlx-ckpt.safetensors is for csm-mlx repository.

Note: Please use the speaker_id 4 while inferencing - since that's what model was trained with!

For original PyTorch based CSM implementations, changing the repository name shoud work - since all filenames are identical.

For csm-mlx, since filename is not ckpt.safetensors but mlx-ckpt.safetensors you should load the latter. Like this:

from mlx_lm.sample_utils import make_sampler
from huggingface_hub import hf_hub_download
from csm_mlx import CSM, csm_1b, generate

import audiofile
import numpy as np

csm = CSM(csm_1b())
weight = hf_hub_download(repo_id="senstella/csm-expressiva-1b", filename="mlx-ckpt.safetensors") # Here's the difference!
csm.load_weights(weight)

audio = generate(
    csm,
    text="Hello from Sesame.",
    speaker=4, # And this is another difference - please use 4 regardless of where you're inferencing!
    context=[],
    max_audio_length_ms=20_000,
    sampler=make_sampler(temp=0.8, top_k=50)
)

audiofile.write("./audio.wav", np.asarray(audio), 24000)

Some observations:

Small-set SFT somewhat mitigates CSM base model failure cases (Non-ending silence etc.)
- It sometimes still fails, but much less frequently than before SFT tuning.
A small SFT run can easily copy the voice in nice detail.
Seems much stabler when quantized! (This was reported in this PR first!)

Hyperparameters used:

batch_size: 1
epoch: 1
first_codebook_weight_multiplier: 1.1
learning-rate: 1e-4
weight-decay: 1e-4
optimizer: adamw
lora-rank: 8
lora-alpha: 16
target-modules: attn, codebook0_head, projection

The future plan is to implement KTO on csm-mlx and further mitigate model failure cases using that approach.

Note

This model was fine-tuned to investigate whether the CSM-1b model exhibits emergent capacity to effectively compress and reconstruct whisper-style vocal features - something that traditional TTS models do not usually demonstrate. It also serves as a preliminary verification of the csm-mlx training setup and the correctness of its loss function. I want to make it clear that I do not endorse or encourage any inappropriate use of this model. Any unintended associations or interpretations do not reflect the intent behind this model.

License

Licence follows Expresso dataset's cc-by-nc-4.0, since it's trained from it!

senstella
/

csm-expressiva-1b

csm-experssiva

Model tree for senstella/csm-expressiva-1b

Dataset used to train senstella/csm-expressiva-1b