Whisper-medium Singlish2English translation model

Hugging Face

Model overview

This model is a fine-tuned version of openai/whisper-medium, trained on over 20,000 Singlish-English text pairs from the health coaching (HC) sessions conducted in Singapore.


Custom dataset overview

To enable fine-tuning of open-source foundation ASR models, we curated HC dataset:

  • First, we collected audios from nearly 90 HC sessions involving three health coaches (Singaporean, Malaysian and European) and 40 Singaporean patients who are not adherent to taking cholesterol-lowering medication. These patients were recruited from three polyclinics in Singapore and received compensation for their time spent participating in our study.
  • Second, we used a fine-tuned ivabojic/whisper-medium-sing2eng-transcribe model to generate audio transcriptions.
  • Third, we employed GPT-4o mini to generate Singlish-to-English translations text pairs for these audio transcriptions.

The initial HC training dataset comprised GPT-generated translations for 5,000 original audio segments, each longer than 2 seconds. This dataset was then expanded by applying three additional rephrasing prompts to each original transcript, generating four translations per segment and increasing the total number of samples to 20,000. The HC validation dataset consist of 2,000 samples each, generated using a single prompt for rephrasing.

Table 1: Overview of the custom-created translation datasets.

Name Samples Total hours Avg. duration (s) Min (s) Max (s)
HCtrain 20,000 94.9 17.1 2.0 378.4
HCvalid 2,000 9.2 16.6 2.0 463.0

Evaluation

Evaluation was conducted on the NSCP16_conv dataset containing 6,000 Singlish-to-English translations text pairs. Performance was measured using BLEU, comparing the fine-tuned model against the off-the-shelf Whisper-medium baseline.

NSCP16_conv bespoke dataset constructed from the Singapore National Speech Corpus (NSC). It is designed to capture the range and richness of Singlish conversational contexts.

  • Conversational and expressive speech includes:
    • Part 3: Natural dialogues on everyday topics between Singaporean speakers.
    • Part 5: Stylized recordings simulating debates, finance-related discussions, and emotional expressions (both positive and negative).
    • Part 6: Scenario-based dialogues, where speakers engage in topic-driven, semi-scripted interactions covering various themes.

Together, these components make NSCP16_conv a robust dataset for building speech models that generalize well across local speech styles, tones, and speaking conditions.

Table 2: Evaluation results on the test dataset using BLEU. A higher BLEU indicates better performance (โ†‘).

Model BLEU (โ†‘)
Whisper-medium (off-the-shelf) 34.45
Whisper-medium-Sing2Eng (fine-tuned) 40.18

This represents a 5.73 absolute BLEU gain and a 16.6% relative improvement in BLEU over the baseline Whisper-medium model on the NSCP16_conv test set.

By learning from diverse local accents and speaking styles, this model significantly improves translation accuracy for Singaporean speech, making it suitable for both research and production applications in multilingual and code-switched environments.

Usage

import torchaudio, torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

task = 'translate'
model_name = 'ivabojic/whisper-medium-sing2eng-translate'
audio_path = 'path_to_audio'  # e.g: https://github.com/IvaBojic/Singlish2English/blob/main/small_dataset/audios/00862042_713.wav

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name, task)

# Load and resample audio if needed
audio, sr = torchaudio.load(audio_path)
if sr != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
    audio = resampler(audio)
audio = audio.squeeze().numpy()

# Preprocess and generate translation
inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)

translation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(translation)

Project repository

For training scripts, evaluation tools, sample audio files, and more, visit the GitHub repository: https://github.com/IvaBojic/Singlish2English

Downloads last month
25
Safetensors
Model size
764M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ivabojic/whisper-medium-sing2eng-translate

Finetuned
(672)
this model