whisper-medium-english-2-wolof

This model is a fine-tuned version of openai/whisper-medium on the bilalfaye/english-wolof-french-dataset. The model is designed to translate English audio into Wolof text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap. It achieves the following results on the evaluation set:

  • Loss: 1.1668
  • Bleu: 34.6061

Model Description

The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate English speech to Wolof. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency.

Intended Uses & Limitations

Intended uses:

  • Automatic transcription and translation of English audio into Wolof text.
  • Assisting researchers and language learners working with English audio content.

Limitations:

  • May struggle with heavy accents or noisy environments.
  • Performance may vary depending on speaker pronunciation and recording quality.

Training and Evaluation Data

The model was fine-tuned on the bilalfaye/english-wolof-french-dataset, which consists of English audio paired with Wolof translations.

Training Procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 32
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • training_steps: 20000
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Bleu
0.9771 0.8941 2000 0.9736 22.8506
0.6832 1.7881 4000 0.8379 30.0113
0.4568 2.6822 6000 0.8083 33.4759
0.2623 3.5762 8000 0.8506 33.4723
0.1608 4.4703 10000 0.9128 33.6342
0.0758 5.3643 12000 0.9808 33.7770
0.0315 6.2584 14000 1.0546 34.0842
0.0133 7.1524 16000 1.1085 34.2531
0.0057 8.0465 18000 1.1455 34.5325
0.0046 8.9405 20000 1.1668 34.6061

Framework versions

  • Transformers 4.41.2
  • Pytorch 2.4.0+cu121
  • Datasets 3.2.0
  • Tokenizers 0.19.1

Inference

Using Python Code

! pip install transformers datasets torch

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset

# Load model and processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-english-2-wolof").to(device)
processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-english-2-wolof")

# Load dataset
streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True)
iterator = iter(streaming_dataset)
sample = next(iterator)
sample = next(iterator)
sample = next(iterator)


# Preprocess audio
input_features = processor(sample["en_audio"]["audio"]["array"],
                           sampling_rate=sample["en_audio"]["audio"]["sampling_rate"],
                           return_tensors="pt").input_features.to(device)

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print("Correct sentence:", sample["en"])
print("Transcription:", transcription[0])

Using Gradio Interface

! pip install gradio

from transformers import pipeline
import gradio as gr
import numpy as np


# Load model pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-english-2-wolof", device=device)

# Function for transcription
def transcribe(audio):
    if audio is None:
        return "No audio provided. Please try again."

    if isinstance(audio, str):  
        waveform, sample_rate = torchaudio.load(audio)
    elif isinstance(audio, tuple):  # Case microphone (Gradio donne un tuple (fichier, sample_rate))
        waveform, sample_rate = torchaudio.load(audio[0]) 
    else:
        return "Invalid audio input format."
    
    if waveform.shape[0] > 1:
        mono_audio = waveform.mean(dim=0, keepdim=True)
    else:
        mono_audio = waveform

    target_sample_rate = 16000
    if sample_rate != target_sample_rate:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
        mono_audio = resampler(mono_audio)
        sample_rate = target_sample_rate

    mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32)

    result = pipe({"array": mono_audio, "sampling_rate": sample_rate})
    return result['text']


# Create Gradio interfaces
interface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"),  
    outputs="text",
    title="Whisper Medium English Translation",
    description="Record audio in English and translate it to Wolof using a fine-tuned Whisper medium model.",
    #live=True,
)


app = gr.TabbedInterface(
    [interface],
    ["Use Uploaded File or Microphone"]  
)

app.launch(debug=True, share=True)

Author

  • Bilal FAYE
Downloads last month
10
Safetensors
Model size
764M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for bilalfaye/whisper-medium-english-2-wolof

Finetuned
(551)
this model

Dataset used to train bilalfaye/whisper-medium-english-2-wolof