CSM-1B Danish Text-to-Speech (LoRA)
A natural-sounding Danish text-to-speech model based on CSM-1B, fine-tuned using LoRA (Low-Rank Adaptation) on a combination of Common Voice 17, CoRal-TTS, and private Danish speech data. Fine-tuned by Nicolaj Reck.
Model Description
This model is a LoRA adapter for sesame/csm-1b
that enables natural Danish speech synthesis with optional voice control. The adapter was trained specifically for Danish TTS while preserving the multilingual capabilities of the base model.
- Base Model:
sesame/csm-1b
- Language: Danish (da)
- Task: Text-to-Speech
- License: Apache 2.0
- Model Type: LoRA Adapter
- Precision: FP16/BF16
Key Features
- Natural Danish synthesis with clear pronunciation and fluent prosody
- Exceptional English with Danish accent - Perfect for bilingual content
- Voice control with male/female speaker selection
- Efficient fine-tuning using LoRA (only ~16M parameters trained)
- Voice leakage prevention through frozen speaker/codec modules
- Ready-to-use Gradio interface included
Quick Start
Installation
pip install transformers torch torchaudio gradio
Basic Usage
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
# Load model and processor
model = CsmForConditionalGeneration.from_pretrained("nicolajreck/csm-1b-danish-tts")
processor = AutoProcessor.from_pretrained("nicolajreck/csm-1b-danish-tts")
# Generate speech
text = "[1]Hej! Velkommen til dansk tale syntese." # [1] for female voice
inputs = processor(text, add_special_tokens=True).to("cuda")
audio = model.generate(**inputs, output_audio=True)
# Save audio
processor.save_audio(audio, "output.wav")
Web Interface
Launch the included Gradio interface:
python danish_tts.py
Access at http://localhost:7860
for an interactive TTS experience. Or use the live Huggingface Space.
Voice Control
The model supports two speaker voices:
[0]
- Male voice[1]
- Female voice
Simply prefix your Danish text with the speaker token:
[0]God morgen! Hvordan har du det?
(Male)[1]God morgen! Hvordan har du det?
(Female)
Training Details
Training Data
The model was trained on a carefully curated mix of Danish speech data:
- Common Voice 17 Danish: ~10,224 validated samples
- CoRal-TTS Danish: ~16,547 filtered samples
- Private Extension: ~8,644 additional samples
Total: ~35,415 Danish speech samples with balanced representation across datasets.
Training Configuration
- Method: LoRA (Low-Rank Adaptation)
- Rank: 16, Alpha: 32, Dropout: 0.05
- Target Modules:
{q_proj, k_proj, v_proj, o_proj, out_proj, gate_proj, up_proj, down_proj, fc1, fc2}
- Hardware: Single RTX 3090 (24GB)
- Precision: FP16 training, supports FP16/BF16 inference
Data Processing
- Duration filtering: 0.6-16 seconds
- Text normalization: Quote stripping, terminal punctuation
- Equal-probability dataset mixing to prevent bias
- Chat-style formatting with Danish language cue
Recommended Settings
For the most natural and fluent speech, use these generation parameters:
# Natural speech settings
audio = model.generate(
**inputs,
output_audio=True,
do_sample=True,
temperature=0.96,
depth_decoder_temperature=0.7,
top_k=50,
top_p=0.9,
repetition_penalty=1.0
)
Example Outputs
The model handles various Danish text types effectively:
Danish Text | Audio |
---|---|
"Husk at gemme arbejdet, før computeren genstarter, ellers risikerer du at miste både filer og vigtige ændringer." | |
"Vi gør opmærksom på, at toget mod Københavns Hovedbanegård er forsinket med omkring 15 minutter. Vi undskylder ventetiden og takker for jeres tålmodighed." |
Performance
Compared to the base CSM-1B model on Danish text:
- ✅ Pronunciation and word clarity
- ✅ Natural rhythm and speaking flow
- ✅ Speech with fewer dropped sounds
- ✅ Pleasant voice across different text types
Gradio Interface Features
The included danish_tts.py
provides a comprehensive web interface with:
- Three-column layout: Input settings, sampling controls, audio output
- Auto max-length calculation with adjustable multiplier
- Advanced parameter control: Dual temperatures, Top-K/Top-P, repetition penalty
- Pre-configured examples with optimized settings
Limitations
- Optimized specifically for Danish - other languages may have reduced quality
- Requires base model
sesame/csm-1b
to function - Voice control limited to male/female binary selection
Model Architecture
- Base: CSM-1B encoder-decoder with depth decoder
- Audio Format: 24kHz, generated via audio tokens
- LoRA Integration: Language projections only, speaker/codec frozen
- Memory Requirements: ~8GB VRAM for inference
Citation
If you use this model, please cite:
@misc{csm1b-danish-2025,
title={High-Quality Danish Text-to-Speech with CSM-1B: Data Mixing, Voice Control, and LoRA Fine-Tuning},
author={Nicolaj Reck},
year={2024},
howpublished={\url{https://huggingface.co/nicolajreck/csm-1b-danish-tts}},
note={LinkedIn: https://www.linkedin.com/in/nicolaj-reck-053aa38a/}
}
Acknowledgments
Fine-tuned by: Nicolaj Reck -
Thanks to:
- Mozilla Foundation for the Common Voice 17 dataset
- CoRal-TTS project for the Danish speech corpus
- Sesame Research for the base CSM-1B model
- The open-source community for tools and frameworks
License
This model is released under the Apache 2.0 license. Please see the base model license for additional terms.
- Downloads last month
- 150
Model tree for nicolajreck/csm-1b-danish-tts
Base model
sesame/csm-1b