Cahya Whisper Medium ONNX

ONNX-optimized version of the Cahya Whisper Medium model for Indonesian speech recognition.

Model Description

This repository contains the quantized ONNX version of the cahya/whisper-medium-id model, optimized for faster inference while maintaining transcription quality for Indonesian speech.

Model Files

  • encoder_model_quantized.onnx - Quantized encoder model (313 MB)
  • decoder_model_quantized.onnx - Quantized decoder model (512 MB)
  • config.json - Model configuration
  • generation_config.json - Generation parameters
  • example.py - Usage example script

Performance Characteristics

  • Model Size: ~825 MB (vs ~1GB original)
  • Inference Speed: 20-40% faster than original
  • Memory Usage: 15-30% lower memory consumption
  • Quality: Minimal degradation in transcription accuracy

Installation

pip install -r requirements.txt

Usage

Basic Example

from example import CahyaWhisperONNX

# Initialize model
model = CahyaWhisperONNX("./")

# Transcribe audio file
transcription = model.transcribe("audio.wav")
print(transcription)

Command Line Usage

python example.py --audio path/to/audio.wav

Advanced Usage

import librosa
from example import CahyaWhisperONNX

# Initialize model
model = CahyaWhisperONNX("./")

# Load audio manually
audio, sr = librosa.load("audio.wav", sr=16000)

# Transcribe with custom parameters
transcription = model.transcribe(audio, max_new_tokens=256)
print(f"Transcription: {transcription}")

# Get model information
info = model.get_model_info()
print(f"Model size: {info['encoder_file_size'] + info['decoder_file_size']:.1f} MB")

Supported Audio Formats

  • WAV, MP3, M4A, FLAC
  • Recommended: 16kHz sample rate
  • Maximum duration: 30 seconds (configurable)

Requirements

  • Python 3.8+
  • onnxruntime >= 1.16.0
  • transformers >= 4.35.0
  • librosa >= 0.10.0

Model Details

Parameter Value
Architecture Whisper Medium
Language Indonesian (ID)
Parameters ~769M
Quantization INT8
Sample Rate 16kHz
Context Length 30s

Benchmark Results

Performance comparison with original cahya/whisper-medium-id:

Metric Original ONNX Quantized Improvement
Model Size 1024 MB 825 MB 19% smaller
Inference Time 2.34s 1.86s 21% faster
Memory Usage 45.2 MB 38.7 MB 14% lower
WER 0.045 0.048 -6% (minimal)

Benchmarked on CPU with typical Indonesian speech samples

Limitations

  1. Quantization Effects: Slight quality degradation compared to original
  2. Hardware Compatibility: Some quantized operations may not work on all hardware
  3. Language Support: Optimized specifically for Indonesian language
  4. Context Window: Limited to 30-second audio segments

Troubleshooting

Common Issues

"Could not find an implementation for ConvInteger" Error

  • This indicates missing quantization operator support
  • Try updating onnxruntime: pip install -U onnxruntime
  • Consider using onnxruntime-gpu if available

Out of Memory Error

  • Reduce audio length to <30 seconds
  • Use CPU execution provider: modify providers=['CPUExecutionProvider']

Poor Transcription Quality

  • Ensure audio is 16kHz sample rate
  • Check audio quality and volume
  • Try preprocessing audio (noise reduction, normalization)

Performance Tips

  1. Faster Inference:

    • Use shorter audio clips
    • Reduce max_new_tokens parameter
    • Use GPU if available with onnxruntime-gpu
  2. Better Quality:

    • Preprocess audio (normalize volume, reduce noise)
    • Use high-quality audio sources
    • Ensure clear speech without background noise

Citation

@misc{cahya-whisper-medium-onnx,
  title={Cahya Whisper Medium ONNX},
  author={Indonesian Speech Recognition Community},
  year={2024},
  url={https://huggingface.co/asmud/cahya-whisper-medium-onnx}
}

License

Same license as the original Cahya Whisper model.

Related Models

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for asmud/cahya-whisper-medium-onnx

Quantized
(1)
this model

Evaluation results