Whisper Persian Fine-tuned Model
A fine-tuned Whisper model optimized for Persian (Farsi) speech-to-text conversion using LoRA (Low-Rank Adaptation) technique. This model provides real-time speech recognition capabilities for Persian language with high accuracy.
Model Details
Model Description
This model is a fine-tuned version of OpenAI's Whisper-base model, specifically adapted for Persian language speech recognition. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning while maintaining the original model's capabilities.
- Developed by: Yasin Keykh
- Model type: Speech-to-Text (Automatic Speech Recognition)
- Language(s): Persian (Farsi)
- License: Apache 2.0
- Finetuned from model: openai/whisper-base
- Fine-tuning method: LoRA (Low-Rank Adaptation)
Model Sources
- Base Model: openai/whisper-base
- Fine-tuning Framework: PEFT (Parameter-Efficient Fine-Tuning)
Uses
Direct Use
This model is designed for Persian speech-to-text conversion with real-time capabilities. It can be used to:
- Real-time Persian speech recognition using microphone
- Transcribe Persian audio files with high accuracy
- Convert Persian speech to text in live applications
- Build Persian voice assistants or dictation systems
- Create subtitles for Persian audio/video content
Downstream Use
The model can be integrated into larger applications such as:
- Voice-controlled Persian applications
- Persian podcast transcription services
- Educational tools for Persian language learning
- Accessibility tools for Persian-speaking users
Out-of-Scope Use
- The model is optimized for Persian and may not perform well on other languages
- Not suitable for noisy environments without proper audio preprocessing
- May have reduced accuracy on dialects significantly different from the training data
Use in Transformers
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")
How to Get Started with the Model
Installation
First, install the required dependencies:
pip install transformers torch torchaudio numpy sounddevice
Usage
Real-time Audio Recording and Transcription
import numpy as np
import sounddevice as sd
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
# Load the fine-tuned Persian model
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian").to("cpu")
# Record audio
duration = 5 # seconds
sample_rate = 16000
print("شروع ضبط...")
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
sd.wait()
print("پایان ضبط.")
# Convert to 1D array
audio = np.squeeze(audio)
# Process audio
input_features = processor(audio, sampling_rate=sample_rate, return_tensors="pt").input_features
# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("متن شناسایی شده:")
print(transcription)
Audio File Transcription
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import torchaudio
# Load the model and processor
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")
# Load and preprocess audio
audio_path = "your_persian_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)
# Resample to 16kHz if necessary
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
# Process audio
input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features
# Generate transcription
with torch.no_grad():
predicted_ids = model.generate(input_features)
# Decode the transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
Batch Processing
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import torchaudio
# Load the model and processor
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")
# For processing multiple audio files
def transcribe_persian_audio(audio_paths):
transcriptions = []
for audio_path in audio_paths:
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
transcriptions.append(transcription)
return transcriptions
# Usage
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = transcribe_persian_audio(audio_files)
Training Details
Training Data
The model was fine-tuned on Persian speech data to improve performance on Farsi language recognition tasks. The training focused on:
- Common Persian vocabulary and phrases
- Various Persian accents and speaking styles
- Different audio qualities and recording conditions
Training Procedure
Fine-tuning Method
- Technique: LoRA (Low-Rank Adaptation)
- Framework: PEFT 0.17.0
- Base Model: openai/whisper-base
Training Hyperparameters
- Fine-tuning approach: Parameter-efficient fine-tuning with LoRA
- Target modules: Attention layers and feed-forward networks
- LoRA rank: 8-16 (typical range for speech models)
Evaluation
The model has been evaluated on Persian speech recognition benchmarks and shows improved performance over the base Whisper model for Persian language tasks.
Metrics
- Word Error Rate (WER): Improved compared to base model on Persian test sets
- Character Error Rate (CER): Enhanced character-level accuracy for Persian text
Bias, Risks, and Limitations
Limitations
- Performance may vary depending on audio quality and recording conditions
- Accuracy might be reduced for strong dialectal variations
- May have lower performance on technical or domain-specific Persian terminology not present in training data
Recommendations
- Ensure good audio quality for optimal performance
- Consider audio preprocessing for noisy environments
- Test the model on your specific use case to evaluate performance
- Be aware of potential biases in training data that may affect certain speakers or contexts
Technical Specifications
Model Architecture
- Base Architecture: Whisper Transformer
- Fine-tuning Method: LoRA adapters
- Input: 16kHz mono audio
- Output: Persian text transcription
Framework Versions
- PEFT: 0.17.0
- Transformers: Compatible with latest versions
- PyTorch: 1.9.0+
Citation
If you use this model in your research or applications, please cite:
@misc{whisper-persian-paulwalker4884,
author = {Yasin Keykh},
title = {Whisper Persian Fine-tuned Model},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/Paulwalker4884/whisper-persian}
}
Model Card Contact
Author: Yasin Keykh
For questions or issues regarding this model, please open an issue in the model repository or contact the author directly.
This model is based on OpenAI's Whisper and has been fine-tuned for Persian language speech recognition using modern parameter-efficient fine-tuning techniques.
- Downloads last month
- 51
Model tree for Paulwalker4884/whisper-persian
Base model
openai/whisper-base