🧩 Robust ASR Model for Real-World Swahili Speech (Domain Data Only)

This ASR model is trained exclusively on 50 hours of real-world, domain-specific Swahili audio, including conversational and semi-spontaneous speech. It is designed to handle noisy environments, diverse speaker styles, and more natural linguistic variation. It does similarly well for clean and well-structured speech input

This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. 👉 View all models on GitHub

We are particularly interested in validating the conclusions we’ve observed through our ablation studies:

While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech — especially for underrepresented languages like Swahili and Yoruba. We are inviting the community to try out these models and help assess:

How well the models perform on natural, conversational, or noisy audio
Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
Whether the improvements we've seen in combining diverse datasets generalize to your use case
Gaps between benchmark results and real-world usability
A combination of both yields balanced results but depends on data quality and label accuracy.

Model

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

🚀 How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Swahili-Whisper-Largev2-Decodis_FT", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None
forced_decoder_ids = processor.get_decoder_prompt_ids(language="swahili", task="transcribe")

pipe = pipeline(
  "automatic-speech-recognition",
  model=model,
    processor = "openai/whisper-large-v2",
    tokenizer = "openai/whisper-large-v2",
    feature_extractor = "openai/whisper-large-v2",
  chunk_length_s=15,
  device=device,
  model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
  generate_kwargs = {
                       'num_beams':5,
                       'max_new_tokens':440,
                       'early_stopping':True,
                        'repetition_penalty': 1.8,
                       'language': 'swahili',
                       'task': 'transcribe'
                       }
)
text_output = pipe("audio.wav")['text']

📦 Training Data

Custom real-world dataset
- Swahili
- Collected from real use cases (e.g. mobile recordings, community sources)
- ~50 hours
Not publicly released (due to licensing)

📁 Languages: Swahili (sw)

🏋️‍♂️ Training Setup

Architecture: whisper-large-v2
Framework: Whisper and Huggingface Transformers
Sampling rate: 16 kHz
Preprocessing: Volume normalization, High-Grade noise addition and filtering, Prosodic Augmentation,silence trimming
Learning Rate: 1e-5
Optimizer: Adamw_pytorch
Steps: 3000

📈 Evaluation

Dataset	This Model	Whisper Large V2
FLEURS (benchmark)	34.73	39.40
Decodis Test Set (Collected by DECODIS)	46.44	99.98

🎯 Intended Use

This model is best for:

Noisy, real-world speech input
Community-contributed or semi-structured conversation
Language tools for low-resource environments

⚠️ Limitations

Underperforms on clean datasets like FLEURS mainly due to size of train set
May exhibit bias toward some accents
Limited by the smaller training size (~50h)

📝 Please try the models and share your feedback, issues, or results via:

GitHub Issues: Submit an issue

Hugging Face Discussions: Join the conversation

Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.

RafatK
/

Whisper_Largev2-Swahili-Decodis_FT