--- license: apache-2.0 base_model: openai/whisper-medium tags: - generated_from_trainer metrics: - bleu model-index: - name: whisper-medium-english-2-wolof results: [] datasets: - bilalfaye/english-wolof-french-dataset language: - en - wo pipeline_tag: automatic-speech-recognition --- # whisper-medium-english-2-wolof This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset). The model is designed to translate English audio into Wolof text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap. It achieves the following results on the evaluation set: - Loss: 1.1668 - Bleu: 34.6061 ## Model Description The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate English speech to Wolof. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency. ## Intended Uses & Limitations **Intended uses:** - Automatic transcription and translation of English audio into Wolof text. - Assisting researchers and language learners working with English audio content. **Limitations:** - May struggle with heavy accents or noisy environments. - Performance may vary depending on speaker pronunciation and recording quality. ## Training and Evaluation Data The model was fine-tuned on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset), which consists of English audio paired with Wolof translations. ## Training Procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 32 - eval_batch_size: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - training_steps: 20000 - mixed_precision_training: Native AMP ### Training results | Training Loss | Epoch | Step | Validation Loss | Bleu | |:-------------:|:------:|:-----:|:---------------:|:-------:| | 0.9771 | 0.8941 | 2000 | 0.9736 | 22.8506 | | 0.6832 | 1.7881 | 4000 | 0.8379 | 30.0113 | | 0.4568 | 2.6822 | 6000 | 0.8083 | 33.4759 | | 0.2623 | 3.5762 | 8000 | 0.8506 | 33.4723 | | 0.1608 | 4.4703 | 10000 | 0.9128 | 33.6342 | | 0.0758 | 5.3643 | 12000 | 0.9808 | 33.7770 | | 0.0315 | 6.2584 | 14000 | 1.0546 | 34.0842 | | 0.0133 | 7.1524 | 16000 | 1.1085 | 34.2531 | | 0.0057 | 8.0465 | 18000 | 1.1455 | 34.5325 | | 0.0046 | 8.9405 | 20000 | 1.1668 | 34.6061 | ### Framework versions - Transformers 4.41.2 - Pytorch 2.4.0+cu121 - Datasets 3.2.0 - Tokenizers 0.19.1 ## Inference ### Using Python Code ```python ! pip install transformers datasets torch import torch from transformers import WhisperForConditionalGeneration, WhisperProcessor from datasets import load_dataset # Load model and processor device = "cuda:0" if torch.cuda.is_available() else "cpu" model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-english-2-wolof").to(device) processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-english-2-wolof") # Load dataset streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True) iterator = iter(streaming_dataset) sample = next(iterator) sample = next(iterator) sample = next(iterator) # Preprocess audio input_features = processor(sample["en_audio"]["audio"]["array"], sampling_rate=sample["en_audio"]["audio"]["sampling_rate"], return_tensors="pt").input_features.to(device) # Generate transcription predicted_ids = model.generate(input_features) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) print("Correct sentence:", sample["en"]) print("Transcription:", transcription[0]) ``` ### Using Gradio Interface ```python ! pip install gradio from transformers import pipeline import gradio as gr import numpy as np # Load model pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-english-2-wolof", device=device) # Function for transcription def transcribe(audio): if audio is None: return "No audio provided. Please try again." if isinstance(audio, str): waveform, sample_rate = torchaudio.load(audio) elif isinstance(audio, tuple): # Case microphone (Gradio donne un tuple (fichier, sample_rate)) waveform, sample_rate = torchaudio.load(audio[0]) else: return "Invalid audio input format." if waveform.shape[0] > 1: mono_audio = waveform.mean(dim=0, keepdim=True) else: mono_audio = waveform target_sample_rate = 16000 if sample_rate != target_sample_rate: resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate) mono_audio = resampler(mono_audio) sample_rate = target_sample_rate mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32) result = pipe({"array": mono_audio, "sampling_rate": sample_rate}) return result['text'] # Create Gradio interfaces interface = gr.Interface( fn=transcribe, inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"), outputs="text", title="Whisper Medium English Translation", description="Record audio in English and translate it to Wolof using a fine-tuned Whisper medium model.", #live=True, ) app = gr.TabbedInterface( [interface], ["Use Uploaded File or Microphone"] ) app.launch(debug=True, share=True) ``` **Author** - Bilal FAYE