Whisper Large v3 Turbo (Kinyarwanda)
Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It can transcribe and translate spoken language into text with high accuracy, supporting multiple languages, accents, and noisy environments. It is designed for general-purpose speech processing and can handle various audio inputs.
Whisper-large-v3-turbo is an optimized version of OpenAI's Whisper-large-v3 model, designed to enhance transcription speed while maintaining high accuracy. This optimization is achieved by reducing the number of decoder layers from 32 to 4, resulting in a model that is significantly faster with only a minor decrease in transcription quality.

Fine-tune
I have successfully fine-tuned the Whisper-large-v3-turbo model on the Kinyarwanda ASR Track A dataset, consisting of over 90000 audio files, ranging from 10 to 40 seconds, each accompanied by its corresponding text transcription.
Before fine-tuning our model with the dataset, the recordings, originally encoded using the Opus codec and stored in WebM (Matroska) format at a 48,000 Hz sample rate, were converted to .wav files with a 16,000 Hz sample rate to align with the model’s input requirements.
Configuration
- Trainable layers = encoder - 15 (progressively unfrozen 2 layers every 2 epochs), decoder - 4
- Learning rate = 7e-6
- Batch size = 2 (for both dataloaders)
- Gradient accumulation steps = 8
- Optimizer = AdamW
- Weight decay = 0.1
- Epochs = 10
- Scheduler = Linear (with warmup = 0.05)
Dropout:
- Encoder =
- 0.3 if idx == 20 else
- 0.2 if idx in [21, 22, 29, 30] else 0.0
- Decoder =
- 0.3 if idx == 1 else 0.1
- 0.3 if idx == 20 else
- 0.2 if idx in [21, 22, 29, 30] else 0.0
- 0.3 if idx == 1 else 0.1
Early Stopping: patience=3, min_delta=0.0005
The condition for saving the model is that the test loss, Word Error Rate (WER), and Character Error Rate (CER) must be lower than the previously recorded best values.
Results
The fine-tuned model was saved at epoch 5 with:
- WER: 16.11%
- CER: 3.28%
How to use
If you want to transcribe a mono-channel audio file (.wav) containing a single speaker, use the following code:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch
model_name = "ionut-visan/whisper-large-v3-turbo_kinyarwanda500"
# Load processor and model
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
def preprocess_audio(audio_path, processor):
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
return {key: val.to(device) for key, val in inputs.items()}
def transcribe(audio_path, model, processor):
"""Generate transcription."""
inputs = preprocess_audio(audio_path, processor)
with torch.no_grad():
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
return transcription[0]
# Define audio path
audio_file = "audio.wav"
transcription = transcribe(audio_file, model, processor)
print("Transcription:", transcription)
Example of result:
Transcript: Uyu munsi nagize umunsi mwiza.
Usage
The model can be used for:
- Advanced voice assistants
- Automatic transcription
- Live subtitling systems
- Voice recognition for call centers
- Voice commands for smart devices
- Voice analysis for security (biometric authentication)
- Dictation systems for writers and professionals
- Assistive technology for people with disabilities
Communication
For any questions regarding this model or to explore collaborations on ambitious AI/ML projects, please feel free to contact me at:
- Downloads last month
- 52
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for ionut-visan/whisper-large-v3-turbo_kinyarwanda500
Base model
openai/whisper-large-v3
Finetuned
openai/whisper-large-v3-turbo