TensorBoard
Kanuri
Kanuri

TWB Voice Kanuri TTS 1.0

Model Description

This is a single-speaker Text-to-Speech (TTS) model for Kanuri language, fine-tuned from the CML-TTS multilingual checkpoint using the YourTTS architecture. The model features one female speaker (spk1) and can generate high-quality Kanuri speech from text input.

Model Details

  • Model Architecture: YourTTS (VITS-based)
  • Base Model: CML-TTS Dataset multilingual checkpoint
  • Language: Kanuri (kr, kau)
  • Sample Rate: 24 kHz
  • Speakers: 1 (spk1: female)
  • Model Type: Single-speaker neural TTS
  • Framework: Coqui TTS

Model Architecture Details

  • Text Encoder: 10-layer transformer with 2 attention heads
  • Hidden Channels: 192
  • FFN Hidden Channels: 768
  • Decoder: HiFi-GAN with ResBlock type 2
  • Flow Layers: 4 coupling layers
  • Posterior Encoder: 16-layer WaveNet
  • Speaker Embedding: 512-dimensional d-vectors
  • Language Embedding: 4-dimensional language embeddings

Training Data

The model was trained on approximately 10 hours of Kanuri speech data from a high-quality source:

Female Speaker (spk1)

Data Preprocessing

  • Original Sample Rate: 48 kHz → Target: 24 kHz (downsampled)
  • Audio Format: Mono WAV files
  • Text Processing: Lowercase conversion, diacritics preserved
  • Quality Filters:
    • Duration: 0.5s - 20s
    • Text length: minimum 10 characters
  • Train/Dev Split: 95% train, 5% validation

Character Set

The model supports standard Kanuri orthography including diacritics:

Characters: abcdefghijklmnopqrstuvwxyzáúǝəә

Punctuation: ',-?. and space

Training

The model was fine-tuned until no improvement is recorded with the following configuration:

  • Batch Size: 12
  • Learning Rate: 0.0001 (generator and discriminator)
  • Mixed Precision: FP16
  • Optimizer: AdamW
  • Loss Components: Mel loss (α=45.0), Speaker encoder loss (α=9.0)
  • GPU Setup: 2x NVIDIA GeForce RTX 2080

Evaluation Results

The model was evaluated on a set of 30 Kanuri sentences by a human evaluator using two criteria:

Sample Evaluation Sentences:

  • "loktu nǝngriyi ye lan, nǝyama kulo ye dǝ so shawwa ro wurazen."
  • "nǝlewa nǝm dǝ, kunguna nǝm wa faidan kozǝna."
  • "na done hawar kattu ye so kǝla kurun nǝlewa ye tarzeyen so dǝa wane."

Evaluation Metrics:

  • Pronunciation Accuracy (1-5 scale): 2.8 average
  • Speech Naturalness (1-5 scale): 2.76 average

Usage

Installation

pip install coqui-tts torch scipy numpy huggingface_hub

Quick Start

from TTS.api import TTS
from huggingface_hub import hf_hub_download
import json
import tempfile
import scipy.io.wavfile as wavfile
import numpy as np
import os

# Download and setup model files
config_path = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "config.json")
with open(config_path, 'r') as f:
    config = json.load(f)

# Download required files
model_path = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "best_model_264313.pth")
speakers_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "speakers.pth")
language_ids_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "language_ids.json")
d_vector_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "d_vector.pth")
config_se_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "config_se.json")
model_se_file = hf_hub_download("CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0", "model_se.pth")

# Update config paths
config["speakers_file"] = speakers_file
config["language_ids_file"] = language_ids_file
config["d_vector_file"] = [d_vector_file]
config["model_args"]["speakers_file"] = speakers_file
config["model_args"]["language_ids_file"] = language_ids_file
config["model_args"]["d_vector_file"] = [d_vector_file]
config["model_args"]["speaker_encoder_config_path"] = config_se_file
config["model_args"]["speaker_encoder_model_path"] = model_se_file

# Save updated config
temp_config = tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False)
json.dump(config, temp_config, indent=2)
temp_config.close()

# Load TTS model
tts = TTS(model_path=model_path, config_path=temp_config.name)

# Generate speech
text = "Loktu nǝngriyi ye lan, nǝyama kulo ye dǝ so shawwa ro wurazen."
speaker = "spk1"  # Options: spk1 (female)

wav = tts.synthesizer.tts(text=text.lower(), speaker_name=speaker)
wav_array = np.array(wav, dtype=np.float32)
wavfile.write("output.wav", tts.synthesizer.output_sample_rate, wav_array)

Batch Inference

For batch processing multiple sentences, use the provided inference script:

python3 inference.py /path/to/model.pth
# or
python3 inference.py /path/to/model/directory

The script will generate audio for all evaluation sentences with the available speaker.

Model Limitations

  • Language: Only supports Kanuri language
  • Input Format: Requires lowercase text input
  • Speakers: Limited to 1 pre-trained speaker identity
  • Domain: Trained primarily on educational content and general speech
  • Code-switching: Not optimized for mixed language input

Technical Specifications

  • Input: Raw Kanuri text (UTF-8, lowercase)
  • Output: 24 kHz mono WAV audio
  • Inference Speed: ~0.1-0.5s per sentence (GPU)
  • Memory Requirements: ~2GB GPU memory for inference

Ethical Considerations

  • Consent: All training data used with appropriate permissions
  • Bias: Model reflects the speech patterns and characteristics of the specific speaker in training data
  • Use Cases: Intended for educational, accessibility, and content creation purposes
  • Non-Commercial: This model is released for non-commercial use only

Licensing

This model is released under CC-BY-NC license. For commercial licensing or other uses, please contact [email protected].

Citation

If you use this model in your research or applications, please cite:

@misc{yourtts-kanuri-2025,
  title={YourTTS Kanuri Single-Speaker Text-to-Speech Model},
  author={Alp Öktem},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/your-username/yourtts-kanuri-singlespeaker}
}

Acknowledgments

This dataset was created by CLEAR Global with support from the Patrick J. McGovern Foundation. We acknowledge the following open source projects and resources that made this model possible:

Model Card Authors

Alp Öktem ([email protected])

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0

Space using CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0 1

Collection including CLEAR-Global/TWB-Voice-Kanuri-TTS-1.0