wav2vec2-large-mms-1b-wolof

This model is a fine-tuned version of facebook/mms-1b-all on the Isma/alffa_wolof dataset. It is designed to perform automatic speech recognition (ASR) in the Wolof language.

Model description

This model is based on the Wav2Vec 2.0 architecture, which has been fine-tuned for speech recognition tasks. The base model, facebook/mms-1b-all, was trained on a multilingual corpus for general-purpose ASR. This fine-tuned version has been specifically trained on the Waxal Wolof dataset, which contains audio recordings in the Wolof language.

Training and evaluation data

The model was trained on the Isma/alffa_wolof dataset, which contains audio samples in the Wolof language. This dataset is used to fine-tune the model to improve accuracy on the specific phonetic characteristics of Wolof speech.

Inference manually

! pip install datasets

# Load test dataset
from datasets import load_dataset, Audio

dataset = load_dataset("perrynelson/waxal-wolof", trust_remote_code=True)
dataset

# Display the first audio using Ipython
from IPython.display import Audio, display

Audio(dataset['train'][322]['audio']['array'], rate=16000)

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model on CPU first
model = Wav2Vec2ForCTC.from_pretrained(model_id, 
                                       target_lang="wol", 
                                       torch_dtype=torch.float16  # Use half-precision
                                       ).to(device)


processor = Wav2Vec2Processor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("wol")


# Process the audio
input_dict = processor(
    dataset['train'][322]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt",
    padding=True
)

# Move inputs to the appropriate device for the first processing layer
input_values = input_dict.input_values.to(device, dtype=torch.float16)

# Perform inference
logits = model(input_values).logits

# Decode predictions
pred_ids = torch.argmax(logits, dim=-1)[0]

print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

Inference with pipeline

from transformers import pipeline
import torch

# Model ID
model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

# Determine device (use GPU if available, otherwise fallback to CPU)
device = 0 if torch.cuda.is_available() else -1

# Use half precision (float16) for inference if GPU is available
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Set up the pipeline for automatic speech recognition
pipe = pipeline(
    task="automatic-speech-recognition", 
    model=model_id, 
    processor=model_id, 
    device=device,  # Specify the device (GPU if available, otherwise CPU)
    torch_dtype=torch_dtype,  # Set the precision (float16 for half precision, float32 otherwise)
    framework="pt"  # Use PyTorch as the framework
)

# Input audio processing
audio_array = dataset['train'][322]["audio"]["array"]  # Fetching an audio sample

# Run inference
result = pipe(audio_array)

# Prediction
print("Prediction:")
print(result['text'])

# Reference (for comparison)
print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

Free memory

import gc
import torch
import psutil

# Free up unused memory in CUDA (GPU) - only needed if you use a GPU
if torch.cuda.is_available():
    torch.cuda.empty_cache()  # Clears GPU memory cache
    torch.cuda.reset_peak_memory_stats()  # Resets memory stats

# Collect any unused memory in Python (CPU)
gc.collect()  # Collect unused memory in Python's garbage collector

# Optionally, check memory status after clearing
if torch.cuda.is_available():
    print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()} bytes")
    print(f"GPU Memory Cached: {torch.cuda.memory_reserved()} bytes")
else:
    print(f"CPU Memory Usage: {psutil.virtual_memory().percent}%")

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 20
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.3793	14.0	12250	0.1517	0.1888
0.3709	15.0	13125	0.1512	0.1882
0.3702	16.0	14000	0.1499	0.1858
0.367	17.0	14875	0.1492	0.1848
0.3656	18.0	15750	0.1493	0.1842

Framework versions

Transformers 4.41.2
Pytorch 2.4.0+cu121
Datasets 3.2.0
Tokenizers 0.19.1

Intended uses & limitations

Intended uses: This model is intended for speech-to-text tasks in Wolof. It can be used to transcribe audio recordings in Wolof into written text.
Limitations: This model performs best with clean audio and may struggle with noisy or low-quality recordings. It is designed specifically for the Wolof language and may not work well with other languages.

Author Information

Author: Bilal FAYE

bilalfaye
/

wav2vec2-large-mms-1b-wolof