csm-experssiva
An experimental SFT fine-tune of CSM(Conversational Speech Model) with Expresso's 4th whispering voice. Quick spin-off to see if SFT LoRA tuning of the
csm-mlx
repository works well.
Trained on a MacBook Air M2 16GB with heavy swap usage, it took 0:43:47.
Two style checkpoints are present in the repository, ckpt.pt
and ckpt.safetensors
is for original PyTorch based CSM implementations. And mlx-ckpt.safetensors
is for csm-mlx
repository.
Note: Please use the speaker_id 4 while inferencing - since that's what model was trained with!
For original PyTorch based CSM implementations, changing the repository name shoud work - since all filenames are identical.
For csm-mlx
, since filename is not ckpt.safetensors
but mlx-ckpt.safetensors
you should load the latter. Like this:
from mlx_lm.sample_utils import make_sampler
from huggingface_hub import hf_hub_download
from csm_mlx import CSM, csm_1b, generate
import audiofile
import numpy as np
csm = CSM(csm_1b())
weight = hf_hub_download(repo_id="senstella/csm-expressiva-1b", filename="mlx-ckpt.safetensors") # Here's the difference!
csm.load_weights(weight)
audio = generate(
csm,
text="Hello from Sesame.",
speaker=4, # And this is another difference - please use 4 regardless of where you're inferencing!
context=[],
max_audio_length_ms=20_000,
sampler=make_sampler(temp=0.8, top_k=50)
)
audiofile.write("./audio.wav", np.asarray(audio), 24000)
Some observations:
- Small-set SFT somewhat mitigates CSM base model failure cases (Non-ending silence etc.)
- It sometimes still fails, but much less frequently than before SFT tuning.
- A small SFT run can easily copy the voice in nice detail.
- Seems much stabler when quantized! (This was reported in this PR first!)
Hyperparameters used:
batch_size
: 1epoch
: 1first_codebook_weight_multiplier
: 1.1learning-rate
: 1e-4weight-decay
: 1e-4optimizer
: adamwlora-rank
: 8lora-alpha
: 16target-modules
: attn, codebook0_head, projection
The future plan is to implement KTO on csm-mlx
and further mitigate model failure cases using that approach.
Note
This model was fine-tuned to investigate whether the CSM-1b model exhibits emergent capacity to effectively compress and reconstruct whisper-style vocal features - something that traditional TTS models do not usually demonstrate. It also serves as a preliminary verification of the csm-mlx training setup and the correctness of its loss function. I want to make it clear that I do not endorse or encourage any inappropriate use of this model. Any unintended associations or interpretations do not reflect the intent behind this model.
License
Licence follows Expresso dataset's cc-by-nc-4.0
, since it's trained from it!
- Downloads last month
- 13
Model tree for senstella/csm-expressiva-1b
Base model
sesame/csm-1b