๐Ÿ“ข vhdm/whisper-large-fa-v1

๐ŸŽง Fine-tuned Whisper Large V3 Turbo for Persian Speech Recognition

This model is a fine-tuned version of openai/whisper-large-v3-turbo trained specifically on high-quality Persian speech data from the vhdm/persian-voice-v1 dataset.


๐Ÿงช Evaluation Results

Metric Value
Final Validation Loss 0.1445
Word Error Rate (WER) 14.07%

The model shows consistent improvement over training and reaches a solid WER of ~14% on clean Persian speech data.


๐Ÿง  Model Description

This model aims to bring high-accuracy automatic speech recognition (ASR) to Persian language using the Whisper architecture. By leveraging OpenAI's powerful Whisper Large V3 Turbo backbone and carefully curated Persian data, it can transcribe Persian audio with high fidelity.


โœ… Intended Use

This model is best suited for:

  • ๐Ÿ“ฑ Transcribing Persian voice notes
  • ๐Ÿ—ฃ๏ธ Real-time or batch ASR for Persian podcasts, videos, and interviews
  • ๐Ÿ” Creating searchable transcripts of Persian audio content
  • ๐Ÿงฉ Fine-tuning or domain adaptation for Persian speech tasks

๐Ÿšซ Limitations

  • The model is fine-tuned on clean audio from specific sources and may perform poorly on noisy, accented, or dialectal speech.
  • Not optimized for real-time streaming ASR (though inference is fast).
  • It may occasionally produce hallucinations (incorrect but plausible words), a common issue in Whisper models.

๐Ÿ“š Training Data

The model was trained on the vhdm/persian-voice-v1 dataset, a curated collection of Persian speech recordings with high-quality transcriptions.


โš™๏ธ Training Procedure

  • Optimizer: AdamW (betas=(0.9, 0.999), eps=1e-08)
  • Learning Rate: 1e-5
  • Batch Sizes: Train - 16 | Eval - 8
  • Scheduler: Linear with 500 warmup steps
  • Mixed Precision: Native AMP (automatic mixed precision)
  • Seed: 42
  • Training Steps: 5000

โฑ๏ธ Training Time & Hardware

The model was trained using an NVIDIA H100 GPU, and the full fine-tuning process took approximately 20 hours.


๐Ÿ“ˆ Training Progress

Step Training Loss Validation Loss WER (%)
1000 0.2190 0.2093 22.07
2000 0.1191 0.1698 17.85
3000 0.1051 0.1485 15.79
4000 0.0644 0.1530 16.03
5000 0.0289 0.1445 14.07

๐Ÿงฐ Framework Versions

  • transformers: 4.52.4
  • torch: 2.7.1+cu118
  • datasets: 3.6.0
  • tokenizers: 0.21.1

๐Ÿš€ Try it out

You can load and test the model using ๐Ÿค— Transformers:

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="vhdm/whisper-large-fa-v1")
result = pipe("path_to_persian_audio.wav")
print(result["text"])
Downloads last month
8,712
Safetensors
Model size
809M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 2 Ask for provider support

Model tree for vhdm/whisper-large-fa-v1

Finetuned
(343)
this model

Space using vhdm/whisper-large-fa-v1 1

Evaluation results