DECODIS_Website

🎀 General-Purpose Swahili ASR Model (Open Datasets + Domain Data)

Main code Main code

This automatic speech recognition (ASR) model is trained using open multilingual datasets and a multi-domain in-house dataset to provide high-accuracy transcription for clean, read-aloud Swahili speech.

It achieves strong generalization, maintaining benchmark accuracy while improving performance on real-world test data. The model can do well for both clean and noisy audios.

This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. πŸ‘‰ View all models on GitHub

We are particularly interested in validating the conclusions we’ve observed through our ablation studies:

While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech β€” especially for underrepresented languages like Swahili. We are inviting the community to try out these models and help assess:

  1. How well the models perform on natural, conversational, or noisy audio
  2. Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
  3. Whether the improvements we've seen in combining diverse datasets generalize to your use case
  4. Gaps between benchmark results and real-world usability
  5. A combination of both yields balanced results but depends on data quality and label accuracy.

Model

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.


πŸš€ How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Swahili-Whisper_Largev2-Decodis_Comb_FT", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None
forced_decoder_ids = processor.get_decoder_prompt_ids(language="swahili", task="transcribe")

pipe = pipeline(
  "automatic-speech-recognition",
  model=model,
    processor = "openai/whisper-large-v2",
    tokenizer = "openai/whisper-large-v2",
    feature_extractor = "openai/whisper-large-v2",
  chunk_length_s=15,
  device=device,
  model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
  generate_kwargs = {
                       'num_beams':5,
                       'max_new_tokens':440,
                       'early_stopping':True,
                        'repetition_penalty': 1.8,
                       'language': 'swahili',
                       'task': 'transcribe'
                       }
)
text_output = pipe("audio.wav")['text']

πŸ“Š Total Duration: ~400 hours


πŸ“ Languages: Swahili (sw)

πŸ‹οΈβ€β™‚οΈ Training Strategy

  • Architecture: whisper-large-v2
  • Framework: Whisper and Huggingface Transformers
  • Sampling rate: 16 kHz
  • Preprocessing: Volume normalization, High-Grade noise addition, Prosodic Augmentation, silence trimming
  • Learning Rate: 1e-5
  • Optimizer: Adamw_pytorch
  • Steps: 3000
  • Pretrained on open data
  • Fine-tuned on domain data

πŸ“ˆ Evaluation Metric (WER)

Dataset This Model Whisper Large V2
FLEURS (benchmark) 12.41 39.40
Decodis Test Set (Collected by DECODIS) 39.42 99.98

🎯 Intended Use

  • General-purpose transcription systems
  • Balanced performance on clean and noisy data
  • Speech interfaces in multilingual and informal settings

⚠️ Limitations

  • Slight trade-off in benchmark precision
  • May need more domain data for extreme acoustic variation

πŸ“ Please try the models and share your feedback, issues, or results via:

GitHub Issues: Submit an issue

Hugging Face Discussions: Join the conversation

Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.


Downloads last month
225
Safetensors
Model size
1.54B params
Tensor type
FP16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RafatK/Whisper_Largev2-Swahili-Decodis_Comb_FT

Finetuned
(209)
this model