Whisper-medium Singlish2English translation model

Model overview

This model is a fine-tuned version of openai/whisper-medium, trained on over 20,000 Singlish-English text pairs from the health coaching (HC) sessions conducted in Singapore.

Custom dataset overview

To enable fine-tuning of open-source foundation ASR models, we curated HC dataset:

First, we collected audios from nearly 90 HC sessions involving three health coaches (Singaporean, Malaysian and European) and 40 Singaporean patients who are not adherent to taking cholesterol-lowering medication. These patients were recruited from three polyclinics in Singapore and received compensation for their time spent participating in our study.
Second, we used a fine-tuned ivabojic/whisper-medium-sing2eng-transcribe model to generate audio transcriptions.
Third, we employed GPT-4o mini to generate Singlish-to-English translations text pairs for these audio transcriptions.

The initial HC training dataset comprised GPT-generated translations for 5,000 original audio segments, each longer than 2 seconds. This dataset was then expanded by applying three additional rephrasing prompts to each original transcript, generating four translations per segment and increasing the total number of samples to 20,000. The HC validation dataset consist of 2,000 samples each, generated using a single prompt for rephrasing.

Table 1: Overview of the custom-created translation datasets.

Name	Samples	Total hours	Avg. duration (s)	Min (s)	Max (s)
HC_train	20,000	94.9	17.1	2.0	378.4
HC_valid	2,000	9.2	16.6	2.0	463.0

Evaluation

Evaluation was conducted on the NSC_{P16_conv} dataset containing 6,000 Singlish-to-English translations text pairs. Performance was measured using BLEU, comparing the fine-tuned model against the off-the-shelf Whisper-medium baseline.

NSC_{P16_conv} bespoke dataset constructed from the Singapore National Speech Corpus (NSC). It is designed to capture the range and richness of Singlish conversational contexts.

Conversational and expressive speech includes:
- Part 3: Natural dialogues on everyday topics between Singaporean speakers.
- Part 5: Stylized recordings simulating debates, finance-related discussions, and emotional expressions (both positive and negative).
- Part 6: Scenario-based dialogues, where speakers engage in topic-driven, semi-scripted interactions covering various themes.

Together, these components make NSC_{P16_conv} a robust dataset for building speech models that generalize well across local speech styles, tones, and speaking conditions.

Table 2: Evaluation results on the test dataset using BLEU. A higher BLEU indicates better performance (↑).

Model	BLEU (↑)
Whisper-medium (off-the-shelf)	34.45
Whisper-medium-Sing2Eng (fine-tuned)	40.18

This represents a 5.73 absolute BLEU gain and a 16.6% relative improvement in BLEU over the baseline Whisper-medium model on the NSC_{P16_conv} test set.

By learning from diverse local accents and speaking styles, this model significantly improves translation accuracy for Singaporean speech, making it suitable for both research and production applications in multilingual and code-switched environments.

Usage

import torchaudio, torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

task = 'translate'
model_name = 'ivabojic/whisper-medium-sing2eng-translate'
audio_path = 'path_to_audio'  # e.g: https://github.com/IvaBojic/Singlish2English/blob/main/small_dataset/audios/00862042_713.wav

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name, task)

# Load and resample audio if needed
audio, sr = torchaudio.load(audio_path)
if sr != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
    audio = resampler(audio)
audio = audio.squeeze().numpy()

# Preprocess and generate translation
inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)

translation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(translation)

Project repository

For training scripts, evaluation tools, sample audio files, and more, visit the GitHub repository: https://github.com/IvaBojic/Singlish2English