CSM-1B Danish Text-to-Speech (LoRA)

A natural-sounding Danish text-to-speech model based on CSM-1B, fine-tuned using LoRA (Low-Rank Adaptation) on a combination of Common Voice 17, CoRal-TTS, and private Danish speech data. Fine-tuned by Nicolaj Reck.

Model Description

This model is a LoRA adapter for sesame/csm-1b that enables natural Danish speech synthesis with optional voice control. The adapter was trained specifically for Danish TTS while preserving the multilingual capabilities of the base model.

Base Model: sesame/csm-1b
Language: Danish (da)
Task: Text-to-Speech
License: Apache 2.0
Model Type: LoRA Adapter
Precision: FP16/BF16

Key Features

Natural Danish synthesis with clear pronunciation and fluent prosody
Exceptional English with Danish accent - Perfect for bilingual content
Voice control with male/female speaker selection
Efficient fine-tuning using LoRA (only ~16M parameters trained)
Voice leakage prevention through frozen speaker/codec modules
Ready-to-use Gradio interface included

Quick Start

Installation

pip install transformers torch torchaudio gradio

Basic Usage

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor

# Load model and processor
model = CsmForConditionalGeneration.from_pretrained("nicolajreck/csm-1b-danish-tts")
processor = AutoProcessor.from_pretrained("nicolajreck/csm-1b-danish-tts")

# Generate speech
text = "[1]Hej! Velkommen til dansk tale syntese."  # [1] for female voice
inputs = processor(text, add_special_tokens=True).to("cuda")
audio = model.generate(**inputs, output_audio=True)

# Save audio
processor.save_audio(audio, "output.wav")

Web Interface

Launch the included Gradio interface:

python danish_tts.py

Access at http://localhost:7860 for an interactive TTS experience. Or use the live Huggingface Space.

Voice Control

The model supports two speaker voices:

[0] - Male voice
[1] - Female voice

Simply prefix your Danish text with the speaker token:

[0]God morgen! Hvordan har du det? (Male)
[1]God morgen! Hvordan har du det? (Female)

Training Details

Training Data

The model was trained on a carefully curated mix of Danish speech data:

Common Voice 17 Danish: ~10,224 validated samples
CoRal-TTS Danish: ~16,547 filtered samples
Private Extension: ~8,644 additional samples

Total: ~35,415 Danish speech samples with balanced representation across datasets.

Training Configuration

Method: LoRA (Low-Rank Adaptation)
Rank: 16, Alpha: 32, Dropout: 0.05
Target Modules: {q_proj, k_proj, v_proj, o_proj, out_proj, gate_proj, up_proj, down_proj, fc1, fc2}
Hardware: Single RTX 3090 (24GB)
Precision: FP16 training, supports FP16/BF16 inference

Data Processing

Duration filtering: 0.6-16 seconds
Text normalization: Quote stripping, terminal punctuation
Equal-probability dataset mixing to prevent bias
Chat-style formatting with Danish language cue

Recommended Settings

For the most natural and fluent speech, use these generation parameters:

# Natural speech settings
audio = model.generate(
    **inputs,
    output_audio=True,
    do_sample=True,
    temperature=0.96,
    depth_decoder_temperature=0.7, 
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.0
)

Example Outputs

The model handles various Danish text types effectively:

Danish Text	Audio
"Husk at gemme arbejdet, før computeren genstarter, ellers risikerer du at miste både filer og vigtige ændringer."
"Vi gør opmærksom på, at toget mod Københavns Hovedbanegård er forsinket med omkring 15 minutter. Vi undskylder ventetiden og takker for jeres tålmodighed."

Performance

Compared to the base CSM-1B model on Danish text:

✅ Pronunciation and word clarity
✅ Natural rhythm and speaking flow
✅ Speech with fewer dropped sounds
✅ Pleasant voice across different text types

Gradio Interface Features

The included danish_tts.py provides a comprehensive web interface with:

Three-column layout: Input settings, sampling controls, audio output
Auto max-length calculation with adjustable multiplier
Advanced parameter control: Dual temperatures, Top-K/Top-P, repetition penalty
Pre-configured examples with optimized settings

Limitations

Optimized specifically for Danish - other languages may have reduced quality
Requires base model sesame/csm-1b to function
Voice control limited to male/female binary selection

Model Architecture

Base: CSM-1B encoder-decoder with depth decoder
Audio Format: 24kHz, generated via audio tokens
LoRA Integration: Language projections only, speaker/codec frozen
Memory Requirements: ~8GB VRAM for inference

Citation

If you use this model, please cite:

@misc{csm1b-danish-2025,
  title={High-Quality Danish Text-to-Speech with CSM-1B: Data Mixing, Voice Control, and LoRA Fine-Tuning},
  author={Nicolaj Reck},
  year={2024},
  howpublished={\url{https://huggingface.co/nicolajreck/csm-1b-danish-tts}},
  note={LinkedIn: https://www.linkedin.com/in/nicolaj-reck-053aa38a/}
}

Acknowledgments

Fine-tuned by: Nicolaj Reck -

Thanks to:

Mozilla Foundation for the Common Voice 17 dataset
CoRal-TTS project for the Danish speech corpus
Sesame Research for the base CSM-1B model
The open-source community for tools and frameworks

License

This model is released under the Apache 2.0 license. Please see the base model license for additional terms.

Downloads last month: 150

Safetensors

Model size

1.65B params

Tensor type

F16

Model tree for nicolajreck/csm-1b-danish-tts

Base model

sesame/csm-1b

Adapter

(2)

this model

nicolajreck
/

csm-1b-danish-tts