whisper-medium-wolof-2-english

This model is a fine-tuned version of openai/whisper-medium on the bilalfaye/english-wolof-french-dataset. The model is designed to translate Wolof audio into English text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap.

It achieves the following results on the evaluation set:

Loss: 1.7756
BLEU: 25.3308

Model Description

The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate Wolof speech to English. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency.

Intended Uses & Limitations

Intended uses:

Automatic transcription and translation of Wolof audio into English text.
Assisting researchers and language learners working with Wolof audio content.

Limitations:

May struggle with heavy accents or noisy environments.
Performance may vary depending on speaker pronunciation and recording quality.

Training and Evaluation Data

The model was fine-tuned on the bilalfaye/english-wolof-french-dataset, which consists of Wolof audio paired with English translations.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

Learning Rate: 1e-05
Train Batch Size: 32
Eval Batch Size: 16
Seed: 42
Optimizer: Adam (betas=(0.9,0.999), epsilon=1e-08)
LR Scheduler Type: Linear
Warmup Steps: 500
Training Steps: 20000
Mixed Precision Training: Native AMP

Training Results

Training Loss	Epoch	Step	Validation Loss	BLEU
1.1851	0.8941	2000	1.1864	18.7395
0.8701	1.7881	4000	1.1268	22.3615
0.566	2.6822	6000	1.1656	24.4993
0.3238	3.5762	8000	1.2711	25.1466
0.1725	4.4703	10000	1.3854	24.7036
0.0821	5.3643	12000	1.4924	25.2531
0.0424	6.2584	14000	1.5961	24.4800
0.018	7.1524	16000	1.6757	24.8197
0.0101	8.0465	18000	1.7439	25.1500
0.0089	8.9405	20000	1.7756	25.3308

Framework Versions

Transformers: 4.41.2
PyTorch: 2.4.0+cu121
Datasets: 3.2.0
Tokenizers: 0.19.1

Inference

Using Python Code

! pip install transformers datasets torch

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset

# Load model and processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-wolof-2-english").to(device)
processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-wolof-2-english")

# Load dataset
streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True)
iterator = iter(streaming_dataset)
sample = next(iterator)
sample = next(iterator)
sample = next(iterator)


# Preprocess audio
input_features = processor(sample["wo_audio"]["audio"]["array"],
                           sampling_rate=sample["wo_audio"]["audio"]["sampling_rate"],
                           return_tensors="pt").input_features.to(device)

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print("Correct sentence:", sample["wo"])
print("Transcription:", transcription[0])

Using Gradio Interface

! pip install gradio

from transformers import pipeline
import gradio as gr
import numpy as np


# Load model pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-wolof-2-english", device=device)

# Function for transcription
def transcribe(audio):
    if audio is None:
        return "No audio provided. Please try again."

    if isinstance(audio, str):  
        waveform, sample_rate = torchaudio.load(audio)
    elif isinstance(audio, tuple):  # Case microphone (Gradio donne un tuple (fichier, sample_rate))
        waveform, sample_rate = torchaudio.load(audio[0])  
    else:
        return "Invalid audio input format."
    
    if waveform.shape[0] > 1:
        mono_audio = waveform.mean(dim=0, keepdim=True)
    else:
        mono_audio = waveform

    target_sample_rate = 16000
    if sample_rate != target_sample_rate:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
        mono_audio = resampler(mono_audio)
        sample_rate = target_sample_rate

    mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32)

    result = pipe({"array": mono_audio, "sampling_rate": sample_rate})
    return result['text']


# Create Gradio interfaces
interface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"),  
    outputs="text",
    title="Whisper Medium Wolof Translation",
    description="Record audio in Wolof and translate it to English using a fine-tuned Whisper medium model.",
    #live=True,
)


app = gr.TabbedInterface(
    [interface],
    ["Use Uploaded File or Microphone"]  
)

app.launch(debug=True, share=True)

Author

Bilal FAYE

bilalfaye
/

whisper-medium-wolof-2-english