TWB Voice Kanuri TTS 1.0

Model Description

This is a single-speaker Text-to-Speech (TTS) model for Kanuri language, fine-tuned from the CML-TTS multilingual checkpoint using the YourTTS architecture. The model features one female speaker (spk1) and can generate high-quality Kanuri speech from text input.

Model Details

Model Architecture: YourTTS (VITS-based)
Base Model: CML-TTS Dataset multilingual checkpoint
Language: Kanuri (kr, kau)
Sample Rate: 24 kHz
Speakers: 1 (spk1: female)
Model Type: Single-speaker neural TTS
Framework: Coqui TTS

Model Architecture Details

Text Encoder: 10-layer transformer with 2 attention heads
Hidden Channels: 192
FFN Hidden Channels: 768
Decoder: HiFi-GAN with ResBlock type 2
Flow Layers: 4 coupling layers
Posterior Encoder: 16-layer WaveNet
Speaker Embedding: 512-dimensional d-vectors
Language Embedding: 4-dimensional language embeddings

Training Data

The model was trained on approximately 10 hours of Kanuri speech data from a high-quality source:

Female Speaker (spk1)

Source: TWB Voice project
Duration: ~10 hours
Sample Dataset: TWB-voice-TTS-Kanuri-1.0-sampleset
Description: High-quality female voice recordings collected within the TWB Voice 1.0 project.

Data Preprocessing

Original Sample Rate: 48 kHz → Target: 24 kHz (downsampled)
Audio Format: Mono WAV files
Text Processing: Lowercase conversion, diacritics preserved
Quality Filters:
- Duration: 0.5s - 20s
- Text length: minimum 10 characters
Train/Dev Split: 95% train, 5% validation

Character Set

The model supports standard Kanuri orthography including diacritics:

Characters: abcdefghijklmnopqrstuvwxyzáúǝəә

Punctuation: ',-?. and space

Training

The model was fine-tuned until no improvement is recorded with the following configuration:

Batch Size: 12
Learning Rate: 0.0001 (generator and discriminator)
Mixed Precision: FP16
Optimizer: AdamW
Loss Components: Mel loss (α=45.0), Speaker encoder loss (α=9.0)
GPU Setup: 2x NVIDIA GeForce RTX 2080

Evaluation Results

The model was evaluated on a set of 30 Kanuri sentences by a human evaluator using two criteria:

Sample Evaluation Sentences:

"loktu nǝngriyi ye lan, nǝyama kulo ye dǝ so shawwa ro wurazen."
"nǝlewa nǝm dǝ, kunguna nǝm wa faidan kozǝna."
"na done hawar kattu ye so kǝla kurun nǝlewa ye tarzeyen so dǝa wane."

Evaluation Metrics:

Pronunciation Accuracy (1-5 scale): 2.8 average
Speech Naturalness (1-5 scale): 2.76 average

Usage

Installation

pip install coqui-tts torch scipy numpy huggingface_hub

Quick Start

from TTS.api import TTS
from huggingface_hub import hf_hub_download
import json
import tempfile
import scipy.io.wavfile as wavfile
import numpy as np
import os

# Download and setup model files
config_path = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "config.json")
with open(config_path, 'r') as f:
    config = json.load(f)

# Download required files
model_path = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "best_model_264313.pth")
speakers_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "speakers.pth")
language_ids_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "language_ids.json")
d_vector_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "d_vector.pth")
config_se_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "config_se.json")
model_se_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "model_se.pth")

# Update config paths
config["speakers_file"] = speakers_file
config["language_ids_file"] = language_ids_file
config["d_vector_file"] = [d_vector_file]
config["model_args"]["speakers_file"] = speakers_file
config["model_args"]["language_ids_file"] = language_ids_file
config["model_args"]["d_vector_file"] = [d_vector_file]
config["model_args"]["speaker_encoder_config_path"] = config_se_file
config["model_args"]["speaker_encoder_model_path"] = model_se_file

# Save updated config
temp_config = tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False)
json.dump(config, temp_config, indent=2)
temp_config.close()

# Load TTS model
tts = TTS(model_path=model_path, config_path=temp_config.name)

# Generate speech
text = "Loktu nǝngriyi ye lan, nǝyama kulo ye dǝ so shawwa ro wurazen."
speaker = "spk1"  # Options: spk1 (female)

wav = tts.synthesizer.tts(text=text.lower(), speaker_name=speaker)
wav_array = np.array(wav, dtype=np.float32)
wavfile.write("output.wav", tts.synthesizer.output_sample_rate, wav_array)

Batch Inference

For batch processing multiple sentences, use the provided inference script:

python3 inference.py /path/to/model.pth
# or
python3 inference.py /path/to/model/directory

The script will generate audio for all evaluation sentences with the available speaker.

Model Limitations

Language: Only supports Kanuri language
Input Format: Requires lowercase text input
Speakers: Limited to 1 pre-trained speaker identity
Domain: Trained primarily on educational content and general speech
Code-switching: Not optimized for mixed language input

Technical Specifications

Input: Raw Kanuri text (UTF-8, lowercase)
Output: 24 kHz mono WAV audio
Inference Speed: ~0.1-0.5s per sentence (GPU)
Memory Requirements: ~2GB GPU memory for inference

Ethical Considerations

Consent: All training data used with appropriate permissions
Bias: Model reflects the speech patterns and characteristics of the specific speaker in training data
Use Cases: Intended for educational, accessibility, and content creation purposes
Non-Commercial: This model is released for non-commercial use only

Licensing

This model is released under CC-BY-NC license. For commercial licensing or other uses, please contact [email protected].

Citation

If you use this model in your research or applications, please cite:

@misc{yourtts-kanuri-2025,
  title={YourTTS Kanuri Single-Speaker Text-to-Speech Model},
  author={Alp Öktem},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/your-username/yourtts-kanuri-singlespeaker}
}

Acknowledgments

This dataset was created by CLEAR Global with support from the Patrick J. McGovern Foundation. We acknowledge the following open source projects and resources that made this model possible:

Idiap Coqui TTS: For the YourTTS architecture and training framework
CML-TTS Dataset: For the multilingual base model
TWB Voice Project: For high-quality Kanuri voice data

Model Card Authors

Alp Öktem ([email protected])

CLEAR-Global
/

TWB-Voice-Kanuri-TTS-1.0