whisper-medium-wolof-2-english

This model is a fine-tuned version of openai/whisper-medium on the bilalfaye/english-wolof-french-dataset. The model is designed to translate Wolof audio into English text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap.

It achieves the following results on the evaluation set:

  • Loss: 1.7756
  • BLEU: 25.3308

Model Description

The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate Wolof speech to English. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency.

Intended Uses & Limitations

Intended uses:

  • Automatic transcription and translation of Wolof audio into English text.
  • Assisting researchers and language learners working with Wolof audio content.

Limitations:

  • May struggle with heavy accents or noisy environments.
  • Performance may vary depending on speaker pronunciation and recording quality.

Training and Evaluation Data

The model was fine-tuned on the bilalfaye/english-wolof-french-dataset, which consists of Wolof audio paired with English translations.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning Rate: 1e-05
  • Train Batch Size: 32
  • Eval Batch Size: 16
  • Seed: 42
  • Optimizer: Adam (betas=(0.9,0.999), epsilon=1e-08)
  • LR Scheduler Type: Linear
  • Warmup Steps: 500
  • Training Steps: 20000
  • Mixed Precision Training: Native AMP

Training Results

Training Loss Epoch Step Validation Loss BLEU
1.1851 0.8941 2000 1.1864 18.7395
0.8701 1.7881 4000 1.1268 22.3615
0.566 2.6822 6000 1.1656 24.4993
0.3238 3.5762 8000 1.2711 25.1466
0.1725 4.4703 10000 1.3854 24.7036
0.0821 5.3643 12000 1.4924 25.2531
0.0424 6.2584 14000 1.5961 24.4800
0.018 7.1524 16000 1.6757 24.8197
0.0101 8.0465 18000 1.7439 25.1500
0.0089 8.9405 20000 1.7756 25.3308

Framework Versions

  • Transformers: 4.41.2
  • PyTorch: 2.4.0+cu121
  • Datasets: 3.2.0
  • Tokenizers: 0.19.1

Inference

Using Python Code

! pip install transformers datasets torch

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset

# Load model and processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-wolof-2-english").to(device)
processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-wolof-2-english")

# Load dataset
streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True)
iterator = iter(streaming_dataset)
sample = next(iterator)
sample = next(iterator)
sample = next(iterator)


# Preprocess audio
input_features = processor(sample["wo_audio"]["audio"]["array"],
                           sampling_rate=sample["wo_audio"]["audio"]["sampling_rate"],
                           return_tensors="pt").input_features.to(device)

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print("Correct sentence:", sample["wo"])
print("Transcription:", transcription[0])

Using Gradio Interface

! pip install gradio

from transformers import pipeline
import gradio as gr
import numpy as np


# Load model pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-wolof-2-english", device=device)

# Function for transcription
def transcribe(audio):
    if audio is None:
        return "No audio provided. Please try again."

    if isinstance(audio, str):  
        waveform, sample_rate = torchaudio.load(audio)
    elif isinstance(audio, tuple):  # Case microphone (Gradio donne un tuple (fichier, sample_rate))
        waveform, sample_rate = torchaudio.load(audio[0])  
    else:
        return "Invalid audio input format."
    
    if waveform.shape[0] > 1:
        mono_audio = waveform.mean(dim=0, keepdim=True)
    else:
        mono_audio = waveform

    target_sample_rate = 16000
    if sample_rate != target_sample_rate:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
        mono_audio = resampler(mono_audio)
        sample_rate = target_sample_rate

    mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32)

    result = pipe({"array": mono_audio, "sampling_rate": sample_rate})
    return result['text']


# Create Gradio interfaces
interface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"),  
    outputs="text",
    title="Whisper Medium Wolof Translation",
    description="Record audio in Wolof and translate it to English using a fine-tuned Whisper medium model.",
    #live=True,
)


app = gr.TabbedInterface(
    [interface],
    ["Use Uploaded File or Microphone"]  
)

app.launch(debug=True, share=True)

Author

  • Bilal FAYE
Downloads last month
21
Safetensors
Model size
764M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for bilalfaye/whisper-medium-wolof-2-english

Finetuned
(551)
this model

Dataset used to train bilalfaye/whisper-medium-wolof-2-english