Whisper Persian Fine-tuned Model

A fine-tuned Whisper model optimized for Persian (Farsi) speech-to-text conversion using LoRA (Low-Rank Adaptation) technique. This model provides real-time speech recognition capabilities for Persian language with high accuracy.

Model Details

Model Description

This model is a fine-tuned version of OpenAI's Whisper-base model, specifically adapted for Persian language speech recognition. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning while maintaining the original model's capabilities.

  • Developed by: Yasin Keykh
  • Model type: Speech-to-Text (Automatic Speech Recognition)
  • Language(s): Persian (Farsi)
  • License: Apache 2.0
  • Finetuned from model: openai/whisper-base
  • Fine-tuning method: LoRA (Low-Rank Adaptation)

Model Sources

  • Base Model: openai/whisper-base
  • Fine-tuning Framework: PEFT (Parameter-Efficient Fine-Tuning)

Uses

Direct Use

This model is designed for Persian speech-to-text conversion with real-time capabilities. It can be used to:

  • Real-time Persian speech recognition using microphone
  • Transcribe Persian audio files with high accuracy
  • Convert Persian speech to text in live applications
  • Build Persian voice assistants or dictation systems
  • Create subtitles for Persian audio/video content

Downstream Use

The model can be integrated into larger applications such as:

  • Voice-controlled Persian applications
  • Persian podcast transcription services
  • Educational tools for Persian language learning
  • Accessibility tools for Persian-speaking users

Out-of-Scope Use

  • The model is optimized for Persian and may not perform well on other languages
  • Not suitable for noisy environments without proper audio preprocessing
  • May have reduced accuracy on dialects significantly different from the training data

Use in Transformers

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")

How to Get Started with the Model

Installation

First, install the required dependencies:

pip install transformers torch torchaudio numpy sounddevice

Usage

Real-time Audio Recording and Transcription

import numpy as np
import sounddevice as sd
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

# Load the fine-tuned Persian model
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian").to("cpu")

# Record audio
duration = 5  # seconds
sample_rate = 16000

print("شروع ضبط...")
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
sd.wait()
print("پایان ضبط.")

# Convert to 1D array
audio = np.squeeze(audio)

# Process audio
input_features = processor(audio, sampling_rate=sample_rate, return_tensors="pt").input_features

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print("متن شناسایی شده:")
print(transcription)

Audio File Transcription

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import torchaudio

# Load the model and processor
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")

# Load and preprocess audio
audio_path = "your_persian_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16kHz if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16000)
    waveform = resampler(waveform)

# Process audio
input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features)
    
# Decode the transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

Batch Processing

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import torchaudio

# Load the model and processor
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")

# For processing multiple audio files
def transcribe_persian_audio(audio_paths):
    transcriptions = []
    
    for audio_path in audio_paths:
        waveform, sample_rate = torchaudio.load(audio_path)
        
        if sample_rate != 16000:
            resampler = torchaudio.transforms.Resample(sample_rate, 16000)
            waveform = resampler(waveform)
        
        input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features
        
        with torch.no_grad():
            predicted_ids = model.generate(input_features)
        
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
        transcriptions.append(transcription)
    
    return transcriptions

# Usage
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = transcribe_persian_audio(audio_files)

Training Details

Training Data

The model was fine-tuned on Persian speech data to improve performance on Farsi language recognition tasks. The training focused on:

  • Common Persian vocabulary and phrases
  • Various Persian accents and speaking styles
  • Different audio qualities and recording conditions

Training Procedure

Fine-tuning Method

  • Technique: LoRA (Low-Rank Adaptation)
  • Framework: PEFT 0.17.0
  • Base Model: openai/whisper-base

Training Hyperparameters

  • Fine-tuning approach: Parameter-efficient fine-tuning with LoRA
  • Target modules: Attention layers and feed-forward networks
  • LoRA rank: 8-16 (typical range for speech models)

Evaluation

The model has been evaluated on Persian speech recognition benchmarks and shows improved performance over the base Whisper model for Persian language tasks.

Metrics

  • Word Error Rate (WER): Improved compared to base model on Persian test sets
  • Character Error Rate (CER): Enhanced character-level accuracy for Persian text

Bias, Risks, and Limitations

Limitations

  • Performance may vary depending on audio quality and recording conditions
  • Accuracy might be reduced for strong dialectal variations
  • May have lower performance on technical or domain-specific Persian terminology not present in training data

Recommendations

  • Ensure good audio quality for optimal performance
  • Consider audio preprocessing for noisy environments
  • Test the model on your specific use case to evaluate performance
  • Be aware of potential biases in training data that may affect certain speakers or contexts

Technical Specifications

Model Architecture

  • Base Architecture: Whisper Transformer
  • Fine-tuning Method: LoRA adapters
  • Input: 16kHz mono audio
  • Output: Persian text transcription

Framework Versions

  • PEFT: 0.17.0
  • Transformers: Compatible with latest versions
  • PyTorch: 1.9.0+

Citation

If you use this model in your research or applications, please cite:

@misc{whisper-persian-paulwalker4884,
  author = {Yasin Keykh},
  title = {Whisper Persian Fine-tuned Model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Paulwalker4884/whisper-persian}
}

Model Card Contact

Author: Yasin Keykh

For questions or issues regarding this model, please open an issue in the model repository or contact the author directly.


This model is based on OpenAI's Whisper and has been fine-tuned for Persian language speech recognition using modern parameter-efficient fine-tuning techniques.

Downloads last month
51
Safetensors
Model size
72.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Paulwalker4884/whisper-persian

Adapter
(33)
this model

Dataset used to train Paulwalker4884/whisper-persian