Whisper Large v3 Turbo (Kinyarwanda)

Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It can transcribe and translate spoken language into text with high accuracy, supporting multiple languages, accents, and noisy environments. It is designed for general-purpose speech processing and can handle various audio inputs.

Whisper-large-v3-turbo is an optimized version of OpenAI's Whisper-large-v3 model, designed to enhance transcription speed while maintaining high accuracy. This optimization is achieved by reducing the number of decoder layers from 32 to 4, resulting in a model that is significantly faster with only a minor decrease in transcription quality.

More details

Fine-tune

I have successfully fine-tuned the Whisper-large-v3-turbo model on the Kinyarwanda ASR Track A dataset, consisting of over 90000 audio files, ranging from 10 to 40 seconds, each accompanied by its corresponding text transcription.

Before fine-tuning our model with the dataset, the recordings, originally encoded using the Opus codec and stored in WebM (Matroska) format at a 48,000 Hz sample rate, were converted to .wav files with a 16,000 Hz sample rate to align with the model’s input requirements.

Configuration

Trainable layers = encoder - 15 (progressively unfrozen 2 layers every 2 epochs), decoder - 4

Learning rate = 7e-6

Batch size = 2 (for both dataloaders)

Gradient accumulation steps = 8

Optimizer = AdamW

Weight decay = 0.1

Epochs = 10

Scheduler = Linear (with warmup = 0.05)

Dropout:

Encoder =

0.3 if idx == 20 else

0.2 if idx in [21, 22, 29, 30] else 0.0

Decoder =

0.3 if idx == 1 else 0.1

Early Stopping: patience=3, min_delta=0.0005

The condition for saving the model is that the test loss, Word Error Rate (WER), and Character Error Rate (CER) must be lower than the previously recorded best values.

Results

Error Rates Plot Loss Plot Learning Rate Plot Fine-tuning Metrics

The fine-tuned model was saved at epoch 5 with:

WER: 16.11%

CER: 3.28%

How to use

If you want to transcribe a mono-channel audio file (.wav) containing a single speaker, use the following code:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch

model_name = "ionut-visan/whisper-large-v3-turbo_kinyarwanda500"

# Load processor and model
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def preprocess_audio(audio_path, processor):
    waveform, sample_rate = torchaudio.load(audio_path)

    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
        waveform = resampler(waveform)

    inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
    return {key: val.to(device) for key, val in inputs.items()}

def transcribe(audio_path, model, processor):
    """Generate transcription."""
    inputs = preprocess_audio(audio_path, processor)

    with torch.no_grad():
        generated_ids = model.generate(inputs["input_features"])

    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
    return transcription[0]

# Define audio path
audio_file = "audio.wav"
transcription = transcribe(audio_file, model, processor)
print("Transcription:", transcription)

Example of result:

Transcript: Uyu munsi nagize umunsi mwiza.

Usage

The model can be used for:

Advanced voice assistants

Automatic transcription

Live subtitling systems

Voice recognition for call centers

Voice commands for smart devices

Voice analysis for security (biometric authentication)

Dictation systems for writers and professionals

Assistive technology for people with disabilities

Communication

For any questions regarding this model or to explore collaborations on ambitious AI/ML projects, please feel free to contact me at:

[email protected]

Ionuț Vișan's Linkedin

ionut-visan
/

whisper-large-v3-turbo_kinyarwanda500

Whisper Large v3 Turbo (Kinyarwanda)

Fine-tune

I have successfully fine-tuned the Whisper-large-v3-turbo model on the Kinyarwanda ASR Track A dataset, consisting of over 90000 audio files, ranging from 10 to 40 seconds, each accompanied by its corresponding text transcription.

Before fine-tuning our model with the dataset, the recordings, originally encoded using the Opus codec and stored in WebM (Matroska) format at a 48,000 Hz sample rate, were converted to .wav files with a 16,000 Hz sample rate to align with the model’s input requirements.

Configuration

Trainable layers = encoder - 15 (progressively unfrozen 2 layers every 2 epochs), decoder - 4

Learning rate = 7e-6

Batch size = 2 (for both dataloaders)

Gradient accumulation steps = 8

Optimizer = AdamW

Weight decay = 0.1

Epochs = 10

Scheduler = Linear (with warmup = 0.05)

Dropout:

Encoder =

0.3 if idx == 20 else

0.2 if idx in [21, 22, 29, 30] else 0.0

Decoder =

0.3 if idx == 1 else 0.1

Early Stopping: patience=3, min_delta=0.0005

The condition for saving the model is that the test loss, Word Error Rate (WER), and Character Error Rate (CER) must be lower than the previously recorded best values.

Results

The fine-tuned model was saved at epoch 5 with:

WER: 16.11%

CER: 3.28%

How to use

If you want to transcribe a mono-channel audio file (.wav) containing a single speaker, use the following code:

Example of result:

Transcript: Uyu munsi nagize umunsi mwiza.

Usage

The model can be used for:

Communication

For any questions regarding this model or to explore collaborations on ambitious AI/ML projects, please feel free to contact me at:

[email protected]

Ionuț Vișan's Linkedin

Model tree for ionut-visan/whisper-large-v3-turbo_kinyarwanda500

Whisper Large v3 Turbo (Kinyarwanda)

Fine-tune

I have successfully fine-tuned the Whisper-large-v3-turbo model on the Kinyarwanda ASR Track A dataset, consisting of over 90000 audio files, ranging from 10 to 40 seconds, each accompanied by its corresponding text transcription.

Before fine-tuning our model with the dataset, the recordings, originally encoded using the Opus codec and stored in WebM (Matroska) format at a 48,000 Hz sample rate, were converted to .wav files with a 16,000 Hz sample rate to align with the model’s input requirements.

Configuration

Trainable layers = encoder - 15 (progressively unfrozen 2 layers every 2 epochs), decoder - 4 Learning rate = 7e-6 Batch size = 2 (for both dataloaders) Gradient accumulation steps = 8 Optimizer = AdamW Weight decay = 0.1 Epochs = 10 Scheduler = Linear (with warmup = 0.05)

Dropout:

Encoder = 0.3 if idx == 20 else 0.2 if idx in [21, 22, 29, 30] else 0.0 Decoder = 0.3 if idx == 1 else 0.1

Early Stopping: patience=3, min_delta=0.0005

The condition for saving the model is that the test loss, Word Error Rate (WER), and Character Error Rate (CER) must be lower than the previously recorded best values.

Results

The fine-tuned model was saved at epoch 5 with:

WER: 16.11% CER: 3.28%

How to use

If you want to transcribe a mono-channel audio file (.wav) containing a single speaker, use the following code:

Example of result:

Transcript: Uyu munsi nagize umunsi mwiza.

Usage

The model can be used for:

Advanced voice assistants Automatic transcription Live subtitling systems Voice recognition for call centers Voice commands for smart devices Voice analysis for security (biometric authentication) Dictation systems for writers and professionals Assistive technology for people with disabilities

Communication

For any questions regarding this model or to explore collaborations on ambitious AI/ML projects, please feel free to contact me at:

[email protected] Ionuț Vișan's Linkedin

Model tree for ionut-visan/whisper-large-v3-turbo_kinyarwanda500

Trainable layers = encoder - 15 (progressively unfrozen 2 layers every 2 epochs), decoder - 4

Learning rate = 7e-6

Batch size = 2 (for both dataloaders)

Gradient accumulation steps = 8

Optimizer = AdamW

Weight decay = 0.1

Epochs = 10

Scheduler = Linear (with warmup = 0.05)

Encoder =

0.3 if idx == 20 else

0.2 if idx in [21, 22, 29, 30] else 0.0

Decoder =

0.3 if idx == 1 else 0.1

WER: 16.11%

CER: 3.28%

[email protected]

Ionuț Vișan's Linkedin