whisper-medium-wolof-2-english
This model is a fine-tuned version of openai/whisper-medium on the bilalfaye/english-wolof-french-dataset. The model is designed to translate Wolof audio into English text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap.
It achieves the following results on the evaluation set:
- Loss: 1.7756
- BLEU: 25.3308
Model Description
The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate Wolof speech to English. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency.
Intended Uses & Limitations
Intended uses:
- Automatic transcription and translation of Wolof audio into English text.
- Assisting researchers and language learners working with Wolof audio content.
Limitations:
- May struggle with heavy accents or noisy environments.
- Performance may vary depending on speaker pronunciation and recording quality.
Training and Evaluation Data
The model was fine-tuned on the bilalfaye/english-wolof-french-dataset, which consists of Wolof audio paired with English translations.
Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- Learning Rate: 1e-05
- Train Batch Size: 32
- Eval Batch Size: 16
- Seed: 42
- Optimizer: Adam (betas=(0.9,0.999), epsilon=1e-08)
- LR Scheduler Type: Linear
- Warmup Steps: 500
- Training Steps: 20000
- Mixed Precision Training: Native AMP
Training Results
Training Loss | Epoch | Step | Validation Loss | BLEU |
---|---|---|---|---|
1.1851 | 0.8941 | 2000 | 1.1864 | 18.7395 |
0.8701 | 1.7881 | 4000 | 1.1268 | 22.3615 |
0.566 | 2.6822 | 6000 | 1.1656 | 24.4993 |
0.3238 | 3.5762 | 8000 | 1.2711 | 25.1466 |
0.1725 | 4.4703 | 10000 | 1.3854 | 24.7036 |
0.0821 | 5.3643 | 12000 | 1.4924 | 25.2531 |
0.0424 | 6.2584 | 14000 | 1.5961 | 24.4800 |
0.018 | 7.1524 | 16000 | 1.6757 | 24.8197 |
0.0101 | 8.0465 | 18000 | 1.7439 | 25.1500 |
0.0089 | 8.9405 | 20000 | 1.7756 | 25.3308 |
Framework Versions
- Transformers: 4.41.2
- PyTorch: 2.4.0+cu121
- Datasets: 3.2.0
- Tokenizers: 0.19.1
Inference
Using Python Code
! pip install transformers datasets torch
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
# Load model and processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-wolof-2-english").to(device)
processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-wolof-2-english")
# Load dataset
streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True)
iterator = iter(streaming_dataset)
sample = next(iterator)
sample = next(iterator)
sample = next(iterator)
# Preprocess audio
input_features = processor(sample["wo_audio"]["audio"]["array"],
sampling_rate=sample["wo_audio"]["audio"]["sampling_rate"],
return_tensors="pt").input_features.to(device)
# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print("Correct sentence:", sample["wo"])
print("Transcription:", transcription[0])
Using Gradio Interface
! pip install gradio
from transformers import pipeline
import gradio as gr
import numpy as np
# Load model pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-wolof-2-english", device=device)
# Function for transcription
def transcribe(audio):
if audio is None:
return "No audio provided. Please try again."
if isinstance(audio, str):
waveform, sample_rate = torchaudio.load(audio)
elif isinstance(audio, tuple): # Case microphone (Gradio donne un tuple (fichier, sample_rate))
waveform, sample_rate = torchaudio.load(audio[0])
else:
return "Invalid audio input format."
if waveform.shape[0] > 1:
mono_audio = waveform.mean(dim=0, keepdim=True)
else:
mono_audio = waveform
target_sample_rate = 16000
if sample_rate != target_sample_rate:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
mono_audio = resampler(mono_audio)
sample_rate = target_sample_rate
mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32)
result = pipe({"array": mono_audio, "sampling_rate": sample_rate})
return result['text']
# Create Gradio interfaces
interface = gr.Interface(
fn=transcribe,
inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"),
outputs="text",
title="Whisper Medium Wolof Translation",
description="Record audio in Wolof and translate it to English using a fine-tuned Whisper medium model.",
#live=True,
)
app = gr.TabbedInterface(
[interface],
["Use Uploaded File or Microphone"]
)
app.launch(debug=True, share=True)
Author
- Bilal FAYE
- Downloads last month
- 21
Model tree for bilalfaye/whisper-medium-wolof-2-english
Base model
openai/whisper-medium