π§© Robust ASR Model for Real-World Swahili Speech (Domain Data Only)
This ASR model is trained exclusively on 50 hours of real-world, domain-specific Swahili audio, including conversational and semi-spontaneous speech. It is designed to handle noisy environments, diverse speaker styles, and more natural linguistic variation. It does similarly well for clean and well-structured speech input
This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. π View all models on GitHub
We are particularly interested in validating the conclusions weβve observed through our ablation studies:
While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech β especially for underrepresented languages like Swahili and Yoruba. We are inviting the community to try out these models and help assess:
- How well the models perform on natural, conversational, or noisy audio
- Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
- Whether the improvements we've seen in combining diverse datasets generalize to your use case
- Gaps between benchmark results and real-world usability
- A combination of both yields balanced results but depends on data quality and label accuracy.
Model
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
π How to Use
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Swahili-Whisper-Largev2-Decodis_FT", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None
forced_decoder_ids = processor.get_decoder_prompt_ids(language="swahili", task="transcribe")
pipe = pipeline(
"automatic-speech-recognition",
model=model,
processor = "openai/whisper-large-v2",
tokenizer = "openai/whisper-large-v2",
feature_extractor = "openai/whisper-large-v2",
chunk_length_s=15,
device=device,
model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
generate_kwargs = {
'num_beams':5,
'max_new_tokens':440,
'early_stopping':True,
'repetition_penalty': 1.8,
'language': 'swahili',
'task': 'transcribe'
}
)
text_output = pipe("audio.wav")['text']
π¦ Training Data
- Custom real-world dataset
- Swahili
- Collected from real use cases (e.g. mobile recordings, community sources)
- ~50 hours
- Not publicly released (due to licensing)
π Languages: Swahili (sw
)
ποΈββοΈ Training Setup
- Architecture:
whisper-large-v2
- Framework: Whisper and Huggingface Transformers
- Sampling rate: 16 kHz
- Preprocessing: Volume normalization, High-Grade noise addition and filtering, Prosodic Augmentation,silence trimming
- Learning Rate: 1e-5
- Optimizer: Adamw_pytorch
- Steps: 3000
π Evaluation
Dataset | This Model | Whisper Large V2 |
---|---|---|
FLEURS (benchmark) | 34.73 | 39.40 |
Decodis Test Set (Collected by DECODIS) | 46.44 | 99.98 |
π― Intended Use
This model is best for:
- Noisy, real-world speech input
- Community-contributed or semi-structured conversation
- Language tools for low-resource environments
β οΈ Limitations
- Underperforms on clean datasets like FLEURS mainly due to size of train set
- May exhibit bias toward some accents
- Limited by the smaller training size (~50h)
π Please try the models and share your feedback, issues, or results via:
GitHub Issues: Submit an issue
Hugging Face Discussions: Join the conversation
Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.
- Downloads last month
- 101
Model tree for RafatK/Whisper_Largev2-Swahili-Decodis_FT
Base model
openai/whisper-large-v2