🎤 General-Purpose Swahili ASR Model (Open Datasets + Domain Data)

This automatic speech recognition (ASR) model is trained using open multilingual datasets and a multi-domain in-house dataset to provide high-accuracy transcription for clean, read-aloud Swahili speech.

It achieves strong generalization, maintaining benchmark accuracy while improving performance on real-world test data. The model can do well for both clean and noisy audios.

This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. 👉 View all models on GitHub

We are particularly interested in validating the conclusions we’ve observed through our ablation studies:

While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech — especially for underrepresented languages like Swahili. We are inviting the community to try out these models and help assess:

How well the models perform on natural, conversational, or noisy audio
Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
Whether the improvements we've seen in combining diverse datasets generalize to your use case
Gaps between benchmark results and real-world usability
A combination of both yields balanced results but depends on data quality and label accuracy.

Model

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

🚀 How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Swahili-Whisper_Largev2-Decodis_Comb_FT", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None
forced_decoder_ids = processor.get_decoder_prompt_ids(language="swahili", task="transcribe")

pipe = pipeline(
  "automatic-speech-recognition",
  model=model,
    processor = "openai/whisper-large-v2",
    tokenizer = "openai/whisper-large-v2",
    feature_extractor = "openai/whisper-large-v2",
  chunk_length_s=15,
  device=device,
  model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
  generate_kwargs = {
                       'num_beams':5,
                       'max_new_tokens':440,
                       'early_stopping':True,
                        'repetition_penalty': 1.8,
                       'language': 'swahili',
                       'task': 'transcribe'
                       }
)
text_output = pipe("audio.wav")['text']

📊 Total Duration: ~400 hours

📁 Languages: Swahili (`sw`)

🏋️‍♂️ Training Strategy

Architecture: whisper-large-v2
Framework: Whisper and Huggingface Transformers
Sampling rate: 16 kHz
Preprocessing: Volume normalization, High-Grade noise addition, Prosodic Augmentation, silence trimming
Learning Rate: 1e-5
Optimizer: Adamw_pytorch
Steps: 3000
Pretrained on open data
Fine-tuned on domain data

📈 Evaluation Metric (WER)

Dataset	This Model	Whisper Large V2
FLEURS (benchmark)	12.41	39.40
Decodis Test Set (Collected by DECODIS)	39.42	99.98

🎯 Intended Use

General-purpose transcription systems
Balanced performance on clean and noisy data
Speech interfaces in multilingual and informal settings

⚠️ Limitations

Slight trade-off in benchmark precision
May need more domain data for extreme acoustic variation

📝 Please try the models and share your feedback, issues, or results via:

GitHub Issues: Submit an issue

Hugging Face Discussions: Join the conversation

Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.

RafatK
/

Whisper_Largev2-Swahili-Decodis_Comb_FT

🎤 General-Purpose Swahili ASR Model (Open Datasets + Domain Data)

Model

🚀 How to Use

📁 Languages: Swahili (`sw`)

🏋️‍♂️ Training Strategy

📈 Evaluation Metric (WER)

🎯 Intended Use

⚠️ Limitations

Model tree for RafatK/Whisper_Largev2-Swahili-Decodis_Comb_FT

🎤 General-Purpose Swahili ASR Model (Open Datasets + Domain Data)

Model

🚀 How to Use

📁 Languages: Swahili (sw)

🏋️‍♂️ Training Strategy

📈 Evaluation Metric (WER)

🎯 Intended Use

⚠️ Limitations

Model tree for RafatK/Whisper_Largev2-Swahili-Decodis_Comb_FT

📁 Languages: Swahili (`sw`)