DECODIS_Website

🧩 Robust ASR Model for Real-World Yoruba Speech (Domain Data Only)

Main code Main code

This ASR model is trained exclusively on 50 hours of real-world, domain-specific Yoruba audio, including conversational and semi-spontaneous speech. It is designed to handle noisy environments, diverse speaker styles, and more natural linguistic variation. It does similarly well for clean and well-structured speech input

This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. πŸ‘‰ View all models on GitHub

We are particularly interested in validating the conclusions we’ve observed through our ablation studies:

While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech β€” especially for underrepresented languages like Swahili and Yoruba. We are inviting the community to try out these models and help assess:

  1. How well the models perform on natural, conversational, or noisy audio
  2. Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
  3. Whether the improvements we've seen in combining diverse datasets generalize to your use case
  4. Gaps between benchmark results and real-world usability
  5. A combination of both yields balanced results but depends on data quality and label accuracy.

Model

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.


πŸš€ How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Whisper_Largev2-Yoruba-Decodis_FT", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None

pipe = pipeline(
  "automatic-speech-recognition",
  model=model,
    processor = "openai/whisper-large-v2",
    tokenizer = "openai/whisper-large-v2",
    feature_extractor = "openai/whisper-large-v2",
  chunk_length_s=15,
  device=device,
  model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
  generate_kwargs = {
                       'num_beams':5,
                       'max_new_tokens':440,
                       'early_stopping':True,
                       'language': 'english',
                       'task': 'transcribe'
                       }
)
text_output = pipe("audio.wav")['text']

πŸ“¦ Training Data

  • Custom real-world dataset
    • Yoruba
    • Collected from real use cases (e.g. mobile recordings, community sources)
    • ~23 hours
  • Not publicly released (due to licensing)

πŸ“ Languages: Yoruba (yo)


πŸ‹οΈβ€β™‚οΈ Training Setup

  • Architecture: whisper-large-v2
  • Framework: Whisper and Huggingface Transformers
  • Sampling rate: 16 kHz
  • Preprocessing: Volume normalization, High-Grade noise addition and filtering, Prosodic Augmentation,silence trimming
  • Learning Rate: 1e-5
  • Optimizer: Adamw_pytorch
  • Steps: 3000

πŸ“ˆ Evaluation

Dataset This Model Whisper Large V2
FLEURS (benchmark) 66.22 No-Info
Our test set 64.51 No-Info

🎯 Intended Use

This model is best for:

  • Noisy, real-world speech input
  • Community-contributed or semi-structured conversation
  • Language tools for low-resource environments

⚠️ Limitations

  • Underperforms on clean datasets like FLEURS mainly due to size of train set
  • May exhibit bias toward some accents
  • Limited by the smaller training size (23h)

πŸ“ Please try the models and share your feedback, issues, or results via:

GitHub Issues: Submit an issue

Hugging Face Discussions: Join the conversation

Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.


Downloads last month
35
Safetensors
Model size
1.54B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RafatK/Whisper_Largev2-Yoruba-Decodis_FT

Finetuned
(209)
this model