TWB Voice Kanuri TTS 1.0
Model Description
This is a single-speaker Text-to-Speech (TTS) model for Kanuri language, fine-tuned from the CML-TTS multilingual checkpoint using the YourTTS architecture. The model features one female speaker (spk1) and can generate high-quality Kanuri speech from text input.
Model Details
- Model Architecture: YourTTS (VITS-based)
- Base Model: CML-TTS Dataset multilingual checkpoint
- Language: Kanuri (kr, kau)
- Sample Rate: 24 kHz
- Speakers: 1 (spk1: female)
- Model Type: Single-speaker neural TTS
- Framework: Coqui TTS
Model Architecture Details
- Text Encoder: 10-layer transformer with 2 attention heads
- Hidden Channels: 192
- FFN Hidden Channels: 768
- Decoder: HiFi-GAN with ResBlock type 2
- Flow Layers: 4 coupling layers
- Posterior Encoder: 16-layer WaveNet
- Speaker Embedding: 512-dimensional d-vectors
- Language Embedding: 4-dimensional language embeddings
Training Data
The model was trained on approximately 10 hours of Kanuri speech data from a high-quality source:
Female Speaker (spk1)
- Source: TWB Voice project
- Duration: ~10 hours
- Sample Dataset: TWB-voice-TTS-Kanuri-1.0-sampleset
- Description: High-quality female voice recordings collected within the TWB Voice 1.0 project.
Data Preprocessing
- Original Sample Rate: 48 kHz → Target: 24 kHz (downsampled)
- Audio Format: Mono WAV files
- Text Processing: Lowercase conversion, diacritics preserved
- Quality Filters:
- Duration: 0.5s - 20s
- Text length: minimum 10 characters
- Train/Dev Split: 95% train, 5% validation
Character Set
The model supports standard Kanuri orthography including diacritics:
Characters: abcdefghijklmnopqrstuvwxyzáúǝəә
Punctuation: ',-?.
and space
Training
The model was fine-tuned until no improvement is recorded with the following configuration:
- Batch Size: 12
- Learning Rate: 0.0001 (generator and discriminator)
- Mixed Precision: FP16
- Optimizer: AdamW
- Loss Components: Mel loss (α=45.0), Speaker encoder loss (α=9.0)
- GPU Setup: 2x NVIDIA GeForce RTX 2080
Evaluation Results
The model was evaluated on a set of 30 Kanuri sentences by a human evaluator using two criteria:
Sample Evaluation Sentences:
- "loktu nǝngriyi ye lan, nǝyama kulo ye dǝ so shawwa ro wurazen."
- "nǝlewa nǝm dǝ, kunguna nǝm wa faidan kozǝna."
- "na done hawar kattu ye so kǝla kurun nǝlewa ye tarzeyen so dǝa wane."
Evaluation Metrics:
- Pronunciation Accuracy (1-5 scale): 2.8 average
- Speech Naturalness (1-5 scale): 2.76 average
Usage
Installation
pip install coqui-tts torch scipy numpy huggingface_hub
Quick Start
from TTS.api import TTS
from huggingface_hub import hf_hub_download
import json
import tempfile
import scipy.io.wavfile as wavfile
import numpy as np
import os
# Download and setup model files
config_path = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "config.json")
with open(config_path, 'r') as f:
config = json.load(f)
# Download required files
model_path = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "best_model_264313.pth")
speakers_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "speakers.pth")
language_ids_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "language_ids.json")
d_vector_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "d_vector.pth")
config_se_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "config_se.json")
model_se_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "model_se.pth")
# Update config paths
config["speakers_file"] = speakers_file
config["language_ids_file"] = language_ids_file
config["d_vector_file"] = [d_vector_file]
config["model_args"]["speakers_file"] = speakers_file
config["model_args"]["language_ids_file"] = language_ids_file
config["model_args"]["d_vector_file"] = [d_vector_file]
config["model_args"]["speaker_encoder_config_path"] = config_se_file
config["model_args"]["speaker_encoder_model_path"] = model_se_file
# Save updated config
temp_config = tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False)
json.dump(config, temp_config, indent=2)
temp_config.close()
# Load TTS model
tts = TTS(model_path=model_path, config_path=temp_config.name)
# Generate speech
text = "Loktu nǝngriyi ye lan, nǝyama kulo ye dǝ so shawwa ro wurazen."
speaker = "spk1" # Options: spk1 (female)
wav = tts.synthesizer.tts(text=text.lower(), speaker_name=speaker)
wav_array = np.array(wav, dtype=np.float32)
wavfile.write("output.wav", tts.synthesizer.output_sample_rate, wav_array)
Batch Inference
For batch processing multiple sentences, use the provided inference script:
python3 inference.py /path/to/model.pth
# or
python3 inference.py /path/to/model/directory
The script will generate audio for all evaluation sentences with the available speaker.
Model Limitations
- Language: Only supports Kanuri language
- Input Format: Requires lowercase text input
- Speakers: Limited to 1 pre-trained speaker identity
- Domain: Trained primarily on educational content and general speech
- Code-switching: Not optimized for mixed language input
Technical Specifications
- Input: Raw Kanuri text (UTF-8, lowercase)
- Output: 24 kHz mono WAV audio
- Inference Speed: ~0.1-0.5s per sentence (GPU)
- Memory Requirements: ~2GB GPU memory for inference
Ethical Considerations
- Consent: All training data used with appropriate permissions
- Bias: Model reflects the speech patterns and characteristics of the specific speaker in training data
- Use Cases: Intended for educational, accessibility, and content creation purposes
- Non-Commercial: This model is released for non-commercial use only
Licensing
This model is released under CC-BY-NC license. For commercial licensing or other uses, please contact [email protected].
Citation
If you use this model in your research or applications, please cite:
@misc{yourtts-kanuri-2025,
title={YourTTS Kanuri Single-Speaker Text-to-Speech Model},
author={Alp Öktem},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/your-username/yourtts-kanuri-singlespeaker}
}
Acknowledgments
This dataset was created by CLEAR Global with support from the Patrick J. McGovern Foundation. We acknowledge the following open source projects and resources that made this model possible:
- Idiap Coqui TTS: For the YourTTS architecture and training framework
- CML-TTS Dataset: For the multilingual base model
- TWB Voice Project: For high-quality Kanuri voice data
Model Card Authors
Alp Öktem ([email protected])
- Downloads last month
- 8