whisper-medium-english-2-wolof

This model is a fine-tuned version of openai/whisper-medium on the bilalfaye/english-wolof-french-dataset. The model is designed to translate English audio into Wolof text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap. It achieves the following results on the evaluation set:

Loss: 1.1668
Bleu: 34.6061

Model Description

The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate English speech to Wolof. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency.

Intended Uses & Limitations

Intended uses:

Automatic transcription and translation of English audio into Wolof text.
Assisting researchers and language learners working with English audio content.

Limitations:

May struggle with heavy accents or noisy environments.
Performance may vary depending on speaker pronunciation and recording quality.

Training and Evaluation Data

The model was fine-tuned on the bilalfaye/english-wolof-french-dataset, which consists of English audio paired with Wolof translations.

Training Procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 32
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 20000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Bleu
0.9771	0.8941	2000	0.9736	22.8506
0.6832	1.7881	4000	0.8379	30.0113
0.4568	2.6822	6000	0.8083	33.4759
0.2623	3.5762	8000	0.8506	33.4723
0.1608	4.4703	10000	0.9128	33.6342
0.0758	5.3643	12000	0.9808	33.7770
0.0315	6.2584	14000	1.0546	34.0842
0.0133	7.1524	16000	1.1085	34.2531
0.0057	8.0465	18000	1.1455	34.5325
0.0046	8.9405	20000	1.1668	34.6061

Framework versions

Transformers 4.41.2
Pytorch 2.4.0+cu121
Datasets 3.2.0
Tokenizers 0.19.1

Inference

Using Python Code

! pip install transformers datasets torch

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset

# Load model and processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-english-2-wolof").to(device)
processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-english-2-wolof")

# Load dataset
streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True)
iterator = iter(streaming_dataset)
sample = next(iterator)
sample = next(iterator)
sample = next(iterator)


# Preprocess audio
input_features = processor(sample["en_audio"]["audio"]["array"],
                           sampling_rate=sample["en_audio"]["audio"]["sampling_rate"],
                           return_tensors="pt").input_features.to(device)

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print("Correct sentence:", sample["en"])
print("Transcription:", transcription[0])

Using Gradio Interface

! pip install gradio

from transformers import pipeline
import gradio as gr
import numpy as np


# Load model pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-english-2-wolof", device=device)

# Function for transcription
def transcribe(audio):
    if audio is None:
        return "No audio provided. Please try again."

    if isinstance(audio, str):  
        waveform, sample_rate = torchaudio.load(audio)
    elif isinstance(audio, tuple):  # Case microphone (Gradio donne un tuple (fichier, sample_rate))
        waveform, sample_rate = torchaudio.load(audio[0]) 
    else:
        return "Invalid audio input format."
    
    if waveform.shape[0] > 1:
        mono_audio = waveform.mean(dim=0, keepdim=True)
    else:
        mono_audio = waveform

    target_sample_rate = 16000
    if sample_rate != target_sample_rate:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
        mono_audio = resampler(mono_audio)
        sample_rate = target_sample_rate

    mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32)

    result = pipe({"array": mono_audio, "sampling_rate": sample_rate})
    return result['text']


# Create Gradio interfaces
interface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"),  
    outputs="text",
    title="Whisper Medium English Translation",
    description="Record audio in English and translate it to Wolof using a fine-tuned Whisper medium model.",
    #live=True,
)


app = gr.TabbedInterface(
    [interface],
    ["Use Uploaded File or Microphone"]  
)

app.launch(debug=True, share=True)

Author

Bilal FAYE

bilalfaye
/

whisper-medium-english-2-wolof