Whisper
Collection
Whisper models for automatic speech recognition (ASR) and speech translation, quantized for faster inference speeds.
•
18 items
•
Updated
•
2
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.
This int8 ONNX model is generated by neural-compressor and the fp32 model can be exported with below command:
optimum-cli export onnx --model openai/whisper-large-v2 whisper-large-v2-with-past/ --task automatic-speech-recognition-with-past --opset 13
Model Detail | Description |
---|---|
Model Authors - Company | Intel |
Date | September 1, 2023 |
Version | 1 |
Type | Speech Recognition |
Paper or Other Resources | - |
License | Apache 2.0 |
Questions or Comments | Community Tab |
Intended Use | Description |
---|---|
Primary intended uses | You can use the raw model for automatic speech recognition inference |
Primary intended users | Anyone doing automatic speech recognition inference |
Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people. |
Download the model by cloning the repository:
git clone https://huggingface.co/Intel/whisper-large-v2-int8-dynamic
Evaluate the model with below code:
import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-large-v2'
model_path = 'whisper-large-v2-int8-dynamic'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
os.path.join(model_path, 'encoder_model.onnx'),
os.path.join(model_path, 'decoder_model.onnx'),
os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
audio = batch["audio"]
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
reference = processor.tokenizer._normalize(batch['text'])
references.append(reference)
predicted_ids = model.generate(input_features)[0]
transcription = processor.decode(predicted_ids)
prediction = processor.tokenizer._normalize(transcription)
predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")
accuracy = 1 - wer_result
print("Accuracy: %.5f" % accuracy)
Model | Model Size (GB) | wer |
---|---|---|
FP32 | 13 | 2.87 |
INT8 | 2.4 | 2.82 |