--- language: - yo metrics: - wer base_model: - openai/whisper-large-v2 pipeline_tag: automatic-speech-recognition ---

# 🎤 General-Purpose Yoruba ASR Model (Open Datasets + Domain Data)

This automatic speech recognition (ASR) model is trained using open multilingual datasets and a multi-domain in-house dataset to provide **high-accuracy transcription** for clean, read-aloud **Yoruba** speech. It achieves **strong generalization**, maintaining benchmark accuracy while improving performance on real-world test data. The model can do well for both clean and noisy audios. This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. 👉 View all models on [GitHub](https://github.com/Rafat-decodis/Robust-ASR-for-Low-Resource-Languages) **We are particularly interested in validating the conclusions we’ve observed through our ablation studies**: While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech — especially for underrepresented languages like Yoruba. We are inviting the community to try out these models and help assess: 1. How well the models perform on natural, conversational, or noisy audio 2. Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech. 3. Whether the improvements we've seen in combining diverse datasets generalize to your use case 4. Gaps between benchmark results and real-world usability 5. A combination of both yields balanced results but depends on data quality and label accuracy. ## Model [Whisper](https://github.com/openai/whisper) is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. --- ## 🚀 How to Use ```python from transformers import WhisperForConditionalGeneration, WhisperProcessor from transformers import pipeline from transformers.utils import is_flash_attn_2_available processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2") model = WhisperForConditionalGeneration.from_pretrained("RafatK/Whisper_Largev2-Yoruba-Decodis_Comb_FT", torch_dtype=torch.float16).to("cuda") model.generation_config.input_ids = model.generation_config.forced_decoder_ids model.generation_config.forced_decoder_ids = None pipe = pipeline( "automatic-speech-recognition", model=model, processor = "openai/whisper-large-v2", tokenizer = "openai/whisper-large-v2", feature_extractor = "openai/whisper-large-v2", chunk_length_s=15, device=device, model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"}, generate_kwargs = { 'num_beams':5, 'max_new_tokens':440, 'early_stopping':True, 'language': 'english', 'task': 'transcribe' } ) text_output = pipe("audio.wav")['text'] ``` 📊 **Total Duration**: ~45 hours --- 📁 **Languages**: Yoruba (`yo`) --- ## 🏋️‍♂️ Training Strategy - Architecture: `whisper-large-v2` - Framework: Whisper and Huggingface Transformers - Sampling rate: 16 kHz - Preprocessing: Volume normalization, High-Grade noise addition, Prosodic Augmentation, silence trimming - Learning Rate: 1e-5 - Optimizer: Adamw_pytorch - Steps: 3000 - Pretrained on open data - Fine-tuned on domain data --- ## 📈 Evaluation Metric (WER) | Dataset | This Model | Whisper Large V2| |----------------------|------------|-----------------| | **FLEURS (benchmark)** | **25.44** | **No-Info** | | **Our test set** | **62.05** | **No-Info** | --- ## 🎯 Intended Use - General-purpose transcription systems - Balanced performance on clean and noisy data - Speech interfaces in multilingual and informal settings --- ## ⚠️ Limitations - Slight trade-off in benchmark precision - May need more domain data for extreme acoustic variation --- 📝 Please try the models and share your feedback, issues, or results via: GitHub Issues: Submit an issue Hugging Face Discussions: Join the conversation Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba. ---