oswald-large-v3-turbo-m1
This model is a fine-tuned version of openai/unsloth/whisper-large-v3-turbo on the creole-text-voice dataset.
The main objective is to create a 99% accurate Haitian Creole Speech-to-Text model, capable of transcribing diverse Haitian voices across accents, regions, and speaking styles.
π§ Model description
oswald-large-v3-turbo-m1 is optimized for Haitian Creole automatic speech recognition (ASR). It builds upon the Whisper architecture by OpenAI and adapts it to Haitian Creole through transfer learning and fine-tuning on a high-quality curated dataset containing hours of Haitian Creole audio-text pairs.
- Architecture: Whisper Large
- Fine-tuned for: Haitian Creole (KreyΓ²l Ayisyen)
- Vocabulary: Based on Latin script (Creole orthography), preserving diacritics and linguistic nuances.
- Voice types: Made with female and male synthetics and naturals voices.
- Sampling rate: 16kHz
- Training objective: Maximize transcription accuracy for everyday Creole speech
β Intended uses
Transcribe Haitian Creole speech from:
- Voice notes
- Radio shows
- Interviews
- Public speeches
- Educational content
- Synthetic voices
Enable Creole voice interfaces in:
- Voice assistants
- Transcription services
- Language-learning tools
- Chatbots and accessibility platforms
β οΈ Limitations
- May struggle with:
- Extremely poor audio quality (e.g., heavy background noise)
- Very fast or mumbled speech in some dialects
- Long duration audio file
- Not optimized for real-time transcription on low-resource devices
- Fine-tuned on a specific dataset β might generalize less to completely unseen voice types or rare accents
π Training and evaluation data
The model was trained on the creole-text-voice dataset, which includes:
- 7 hours of Haitian Creole Synthetic speech
- 8 hours of Haitian Creole Human speech
- Annotated, time-aligned text transcripts following standard Creole orthography
Sources for next steps:
- Public domain radio and podcast archives
- Open-access interviews and spoken-word audio
- Community-submitted voice samples
Preprocessing steps:
- Voice Activity Detection (VAD)
- Noise filtering and audio normalization
- Manual transcript review and correction
Model usage script
# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import librosa
import numpy as np
import torch
processor = AutoProcessor.from_pretrained("jsbeaudry/oswald-large-v3-turbo-m1")
model = AutoModelForSpeechSeq2Seq.from_pretrained("jsbeaudry/oswald-large-v3-turbo-m1")
def transcript (audio_file_path):
# Load audio
speech_array, sampling_rate = librosa.load(audio_file_path, sr=16000)
# Convert the NumPy array to a PyTorch tensor
speech_array_pt = torch.from_numpy(speech_array).unsqueeze(0)
input_features = processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt").input_features
# 2. Generate predictions
predicted_ids = model.generate(input_features)
# 3. Decode the predictions
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# print(transcription)
return transcription
text = transcript("/path_audio")
print(text)
Model usage with gradio (UI)
from transformers import pipeline
import gradio as gr
# Load Whisper model
print("Loading model...")
pipe = pipeline(model="jsbeaudry/oswald-large-v3-turbo-m1")
print("Model loaded successfully.")
# Transcription function
def transcribe(audio_path):
if audio_path is None:
return "Please upload or record an audio file first."
result = pipe(audio_path)
return result["text"]
# Build Gradio interface
def create_interface():
with gr.Blocks(title="Whisper Medium - Haitian Creole") as demo:
gr.Markdown("# ποΈ Whisper Medium Creole ASR")
gr.Markdown(
"Upload an audio file or record your voice in Haitian Creole. "
"Then click **Transcribe** to see the result."
)
with gr.Row():
with gr.Column():
audio_input = gr.Audio(source="upload", type="filepath", label="π§ Upload Audio")
with gr.Column():
transcribe_button = gr.Button("π Transcribe")
output_text = gr.Textbox(label="π Transcribed Text", lines=4)
transcribe_button.click(fn=transcribe, inputs=audio_input, outputs=output_text)
return demo
if __name__ == "__main__":
interface = create_interface()
interface.launch()
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-4
- num_epochs: 6.65
- hours: 2:52
Step Training Loss Validation Loss 100 0.565400 0.656878 200 0.481000 0.528320 300 0.457000 0.460658 400 0.822300 0.419748 500 0.298300 0.397042 ..... 8300 0.049500 0.215643 8400 0.024700 0.210167
Framework versions
- Transformers 4.46.1
- Pytorch 2.6.0+cu124
- Datasets 3.5.0
- Tokenizers 0.20.3
π Citation
If you use this model, please cite:
@misc{whispermediumcreoleoswald2025,
title={oswald large turbo M1},
author={Jean sauvenel beaudry},
year={2025},
howpublished={\url{https://huggingface.co/jsbeaudry}}
}
- Downloads last month
- 78
Model tree for jsbeaudry/oswald-large-v3-turbo-m1
Base model
openai/whisper-large-v3