Whisper Persian Fine-tuned Model

A fine-tuned Whisper model optimized for Persian (Farsi) speech-to-text conversion using LoRA (Low-Rank Adaptation) technique. This model provides real-time speech recognition capabilities for Persian language with high accuracy.

Model Details

Model Description

This model is a fine-tuned version of OpenAI's Whisper-base model, specifically adapted for Persian language speech recognition. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning while maintaining the original model's capabilities.

Developed by: Yasin Keykh
Model type: Speech-to-Text (Automatic Speech Recognition)
Language(s): Persian (Farsi)
License: Apache 2.0
Finetuned from model: openai/whisper-base
Fine-tuning method: LoRA (Low-Rank Adaptation)

Model Sources

Base Model: openai/whisper-base
Fine-tuning Framework: PEFT (Parameter-Efficient Fine-Tuning)

Uses

Direct Use

This model is designed for Persian speech-to-text conversion with real-time capabilities. It can be used to:

Real-time Persian speech recognition using microphone
Transcribe Persian audio files with high accuracy
Convert Persian speech to text in live applications
Build Persian voice assistants or dictation systems
Create subtitles for Persian audio/video content

Downstream Use

The model can be integrated into larger applications such as:

Voice-controlled Persian applications
Persian podcast transcription services
Educational tools for Persian language learning
Accessibility tools for Persian-speaking users

Out-of-Scope Use

The model is optimized for Persian and may not perform well on other languages
Not suitable for noisy environments without proper audio preprocessing
May have reduced accuracy on dialects significantly different from the training data

Use in Transformers

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")

How to Get Started with the Model

Installation

First, install the required dependencies:

pip install transformers torch torchaudio numpy sounddevice

Usage

Real-time Audio Recording and Transcription

import numpy as np
import sounddevice as sd
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

# Load the fine-tuned Persian model
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian").to("cpu")

# Record audio
duration = 5  # seconds
sample_rate = 16000

print("شروع ضبط...")
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
sd.wait()
print("پایان ضبط.")

# Convert to 1D array
audio = np.squeeze(audio)

# Process audio
input_features = processor(audio, sampling_rate=sample_rate, return_tensors="pt").input_features

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print("متن شناسایی شده:")
print(transcription)

Audio File Transcription

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import torchaudio

# Load the model and processor
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")

# Load and preprocess audio
audio_path = "your_persian_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16kHz if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16000)
    waveform = resampler(waveform)

# Process audio
input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features)
    
# Decode the transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

Batch Processing

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import torchaudio

# Load the model and processor
processor = AutoProcessor.from_pretrained("Paulwalker4884/whisper-persian")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Paulwalker4884/whisper-persian")

# For processing multiple audio files
def transcribe_persian_audio(audio_paths):
    transcriptions = []
    
    for audio_path in audio_paths:
        waveform, sample_rate = torchaudio.load(audio_path)
        
        if sample_rate != 16000:
            resampler = torchaudio.transforms.Resample(sample_rate, 16000)
            waveform = resampler(waveform)
        
        input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features
        
        with torch.no_grad():
            predicted_ids = model.generate(input_features)
        
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
        transcriptions.append(transcription)
    
    return transcriptions

# Usage
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = transcribe_persian_audio(audio_files)

Training Details

Training Data

The model was fine-tuned on Persian speech data to improve performance on Farsi language recognition tasks. The training focused on:

Common Persian vocabulary and phrases
Various Persian accents and speaking styles
Different audio qualities and recording conditions

Training Procedure

Fine-tuning Method

Technique: LoRA (Low-Rank Adaptation)
Framework: PEFT 0.17.0
Base Model: openai/whisper-base

Training Hyperparameters

Fine-tuning approach: Parameter-efficient fine-tuning with LoRA
Target modules: Attention layers and feed-forward networks
LoRA rank: 8-16 (typical range for speech models)

Evaluation

The model has been evaluated on Persian speech recognition benchmarks and shows improved performance over the base Whisper model for Persian language tasks.

Metrics

Word Error Rate (WER): Improved compared to base model on Persian test sets
Character Error Rate (CER): Enhanced character-level accuracy for Persian text

Bias, Risks, and Limitations

Limitations

Performance may vary depending on audio quality and recording conditions
Accuracy might be reduced for strong dialectal variations
May have lower performance on technical or domain-specific Persian terminology not present in training data

Recommendations

Ensure good audio quality for optimal performance
Consider audio preprocessing for noisy environments
Test the model on your specific use case to evaluate performance
Be aware of potential biases in training data that may affect certain speakers or contexts

Technical Specifications

Model Architecture

Base Architecture: Whisper Transformer
Fine-tuning Method: LoRA adapters
Input: 16kHz mono audio
Output: Persian text transcription

Framework Versions

PEFT: 0.17.0
Transformers: Compatible with latest versions
PyTorch: 1.9.0+

Citation

If you use this model in your research or applications, please cite:

@misc{whisper-persian-paulwalker4884,
  author = {Yasin Keykh},
  title = {Whisper Persian Fine-tuned Model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Paulwalker4884/whisper-persian}
}

Model Card Contact

Author: Yasin Keykh

For questions or issues regarding this model, please open an issue in the model repository or contact the author directly.

This model is based on OpenAI's Whisper and has been fine-tuned for Persian language speech recognition using modern parameter-efficient fine-tuning techniques.

Paulwalker4884
/

whisper-persian