π€ General-Purpose Yoruba ASR Model (Open Datasets + Domain Data)
This automatic speech recognition (ASR) model is trained using open multilingual datasets and a multi-domain in-house dataset to provide high-accuracy transcription for clean, read-aloud Yoruba speech.
It achieves strong generalization, maintaining benchmark accuracy while improving performance on real-world test data. The model can do well for both clean and noisy audios.
This model is part of a full ASR ablation study that analyzes and understands the robustness of data and in dealing with different modes and variations of data collections. π View all models on GitHub
We are particularly interested in validating the conclusions weβve observed through our ablation studies:
While benchmark datasets like FLEURS are useful for comparison, they do not fully capture the variability and challenges of real-world speech β especially for underrepresented languages like Yoruba. We are inviting the community to try out these models and help assess:
- How well the models perform on natural, conversational, or noisy audio
- Open-source datasets (like Common Voice & FLEURS) perform well on clean, benchmark speech.
- Whether the improvements we've seen in combining diverse datasets generalize to your use case
- Gaps between benchmark results and real-world usability
- A combination of both yields balanced results but depends on data quality and label accuracy.
Model
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
π How to Use
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("RafatK/Whisper_Largev2-Yoruba-Decodis_Comb_FT", torch_dtype=torch.float16).to("cuda")
model.generation_config.input_ids = model.generation_config.forced_decoder_ids
model.generation_config.forced_decoder_ids = None
pipe = pipeline(
"automatic-speech-recognition",
model=model,
processor = "openai/whisper-large-v2",
tokenizer = "openai/whisper-large-v2",
feature_extractor = "openai/whisper-large-v2",
chunk_length_s=15,
device=device,
model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
generate_kwargs = {
'num_beams':5,
'max_new_tokens':440,
'early_stopping':True,
'language': 'english',
'task': 'transcribe'
}
)
text_output = pipe("audio.wav")['text']
π Total Duration: ~45 hours
π Languages: Yoruba (yo
)
ποΈββοΈ Training Strategy
- Architecture:
whisper-large-v2
- Framework: Whisper and Huggingface Transformers
- Sampling rate: 16 kHz
- Preprocessing: Volume normalization, High-Grade noise addition, Prosodic Augmentation, silence trimming
- Learning Rate: 1e-5
- Optimizer: Adamw_pytorch
- Steps: 3000
- Pretrained on open data
- Fine-tuned on domain data
π Evaluation Metric (WER)
Dataset | This Model | Whisper Large V2 |
---|---|---|
FLEURS (benchmark) | 25.44 | No-Info |
Our test set | 62.05 | No-Info |
π― Intended Use
- General-purpose transcription systems
- Balanced performance on clean and noisy data
- Speech interfaces in multilingual and informal settings
β οΈ Limitations
- Slight trade-off in benchmark precision
- May need more domain data for extreme acoustic variation
π Please try the models and share your feedback, issues, or results via:
GitHub Issues: Submit an issue
Hugging Face Discussions: Join the conversation
Your feedback will help us refine our dataset and improve ASR for underrepresented languages like Swahili and Yoruba.
- Downloads last month
- 142
Model tree for RafatK/Whisper_Largev2-Yoruba-Decodis_Comb_FT
Base model
openai/whisper-large-v2