--- library_name: transformers language: - fa license: mit base_model: openai/whisper-large-v3-turbo tags: - whisper - whisper-large-v3 - persian - farsi - speech-recognition - asr - automatic-speech-recognition - audio - transformers - generated_from_trainer - h100 - huggingface - vhdm datasets: - vhdm/persian-voice-v1.1 metrics: - wer model-index: - name: vhdm/whisper-large-fa-v1 results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: vhdm/persian-voice-v1 type: vhdm/persian-voice-v1.1 args: 'config: fa, split: test' metrics: - name: Wer type: wer value: 14.065335753176045 --- # ๐Ÿ“ข vhdm/whisper-large-fa-v1 ๐ŸŽง **Fine-tuned Whisper Large V3 Turbo for Persian Speech Recognition** This model is a fine-tuned version of [`openai/whisper-large-v3-turbo`](https://huggingface.co/openai/whisper-large-v3-turbo) trained specifically on high-quality Persian speech data from the [`vhdm/persian-voice-v1`](https://huggingface.co/datasets/vhdm/persian-voice-v1) dataset. --- ## ๐Ÿงช Evaluation Results | Metric | Value | |--------|-------| | **Final Validation Loss** | 0.1445 | | **Word Error Rate (WER)** | **14.07%** | The model shows consistent improvement over training and reaches a solid WER of ~14% on clean Persian speech data. --- ## ๐Ÿง  Model Description This model aims to bring high-accuracy **automatic speech recognition (ASR)** to Persian language using the Whisper architecture. By leveraging OpenAI's powerful Whisper Large V3 Turbo backbone and carefully curated Persian data, it can transcribe Persian audio with high fidelity. --- ## โœ… Intended Use This model is best suited for: - ๐Ÿ“ฑ Transcribing Persian voice notes - ๐Ÿ—ฃ๏ธ Real-time or batch ASR for Persian podcasts, videos, and interviews - ๐Ÿ” Creating searchable transcripts of Persian audio content - ๐Ÿงฉ Fine-tuning or domain adaptation for Persian speech tasks ### ๐Ÿšซ Limitations - The model is fine-tuned on clean audio from specific sources and may perform poorly on noisy, accented, or dialectal speech. - Not optimized for real-time streaming ASR (though inference is fast). - It may occasionally produce hallucinations (incorrect but plausible words), a common issue in Whisper models. --- ## ๐Ÿ“š Training Data The model was trained on the [`vhdm/persian-voice-v1`](https://huggingface.co/datasets/vhdm/persian-voice-v1) dataset, a curated collection of Persian speech recordings with high-quality transcriptions. --- ## โš™๏ธ Training Procedure - **Optimizer**: AdamW (`betas=(0.9, 0.999)`, `eps=1e-08`) - **Learning Rate**: 1e-5 - **Batch Sizes**: Train - 16 | Eval - 8 - **Scheduler**: Linear with 500 warmup steps - **Mixed Precision**: Native AMP (automatic mixed precision) - **Seed**: 42 - **Training Steps**: 5000 --- ## โฑ๏ธ Training Time & Hardware The model was trained using an **NVIDIA H100 GPU**, and the full fine-tuning process took approximately **20 hours**. --- ## ๐Ÿ“ˆ Training Progress | Step | Training Loss | Validation Loss | WER (%) | |------|----------------|-----------------|----------| | 1000 | 0.2190 | 0.2093 | 22.07 | | 2000 | 0.1191 | 0.1698 | 17.85 | | 3000 | 0.1051 | 0.1485 | 15.79 | | 4000 | 0.0644 | 0.1530 | 16.03 | | 5000 | 0.0289 | 0.1445 | **14.07** | --- ## ๐Ÿงฐ Framework Versions - `transformers`: 4.52.4 - `torch`: 2.7.1+cu118 - `datasets`: 3.6.0 - `tokenizers`: 0.21.1 --- ## ๐Ÿš€ Try it out You can load and test the model using ๐Ÿค— Transformers: ```python from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="vhdm/whisper-large-fa-v1") result = pipe("path_to_persian_audio.wav") print(result["text"])