--- language: - sw metrics: - wer base_model: - openai/whisper-large-v2 pipeline_tag: automatic-speech-recognition ---

# 🎤 General-Purpose Swahili ASR Model (Open Datasets + Domain Data)

This automatic speech recognition (ASR) model is trained using open multilingual datasets and a multi-domain in-house dataset to provide **high-accuracy transcription** for clean, read-aloud **Swahili** speech. It achieves **strong generalization**, maintaining benchmark accuracy while improving performance on real-world test data. The model can do well for both clean and noisy audios. This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. 👉 View all models on [GitHub](https://github.com/Rafat-decodis/Robust_Swahili_ASR) **We are particularly interested in validating the conclusions we’ve observed through our ablation studies**: While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech — especially for underrepresented languages like Swahili. We are inviting the community to try out these models and help assess: 1. How well the models perform on natural, conversational, or noisy audio 2. Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech. 3. Whether the improvements we've seen in combining diverse datasets generalize to your use case 4. Gaps between benchmark results and real-world usability 5. A combination of both yields balanced results but depends on data quality and label accuracy. ## Model [Whisper](https://github.com/openai/whisper) is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. --- ## 🚀 How to Use ```python from transformers import WhisperForConditionalGeneration, WhisperProcessor from transformers import pipeline from transformers.utils import is_flash_attn_2_available processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2") model = WhisperForConditionalGeneration.from_pretrained("RafatK/Swahili-Whisper_Largev2-Decodis_Comb_FT", torch_dtype=torch.float16).to("cuda") model.generation_config.input_ids = model.generation_config.forced_decoder_ids model.generation_config.forced_decoder_ids = None forced_decoder_ids = processor.get_decoder_prompt_ids(language="swahili", task="transcribe") pipe = pipeline( "automatic-speech-recognition", model=model, processor = "openai/whisper-large-v2", tokenizer = "openai/whisper-large-v2", feature_extractor = "openai/whisper-large-v2", chunk_length_s=15, device=device, model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"}, generate_kwargs = { 'num_beams':5, 'max_new_tokens':440, 'early_stopping':True, 'repetition_penalty': 1.8, 'language': 'swahili', 'task': 'transcribe' } ) text_output = pipe("audio.wav")['text'] ``` --- 📊 **Total Duration**: ~400 hours --- 📁 **Languages**: Swahili (`sw`) --- ## 🏋️‍♂️ Training Strategy - Architecture: `whisper-large-v2` - Framework: Whisper and Huggingface Transformers - Sampling rate: 16 kHz - Preprocessing: Volume normalization, High-Grade noise addition, Prosodic Augmentation, silence trimming - Learning Rate: 1e-5 - Optimizer: Adamw_pytorch - Steps: 3000 - Pretrained on open data - Fine-tuned on domain data --- ## 📈 Evaluation Metric (WER) | Dataset | This Model | Whisper Large V2| |----------------------|------------|-----------------| | **FLEURS (benchmark)** | **12.41** | **39.40** | | **[Decodis Test Set](https://huggingface.co/datasets/RafatK/Decodis_Test_Set) (Collected by DECODIS)** | **39.42** | **99.98** | --- ## 🎯 Intended Use - General-purpose transcription systems - Balanced performance on clean and noisy data - Speech interfaces in multilingual and informal settings --- ## ⚠️ Limitations - Slight trade-off in benchmark precision - May need more domain data for extreme acoustic variation --- 📝 Please try the models and share your feedback, issues, or results via: GitHub Issues: Submit an issue Hugging Face Discussions: Join the conversation Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba. ---