TheStageAI
/

Elastic-whisper-large-v3

+---
+license: apache-2.0
+base_model:
+- openai/whisper-large-v3
+base_model_relation: quantized
+pipeline_tag: automatic-speech-recognition
+language:
+- en
+- zh
+- de
+- es
+- ru
+- ko
+- fr
+- ja
+- pt
+- tr
+- pl
+- ca
+- nl
+- ar
+- sv
+- it
+- id
+- hi
+- fi
+- vi
+- he
+- uk
+- el
+- ms
+- cs
+- ro
+- da
+- hu
+- ta
+- no
+- th
+- ur
+- hr
+- bg
+- lt
+- la
+- mi
+- ml
+- cy
+- sk
+- te
+- fa
+- lv
+- bn
+- sr
+- az
+- sl
+- kn
+- et
+- mk
+- br
+- eu
+- is
+- hy
+- ne
+- mn
+- bs
+- kk
+- sq
+- sw
+- gl
+- mr
+- pa
+- si
+- km
+- sn
+- yo
+- so
+- af
+- oc
+- ka
+- be
+- tg
+- sd
+- gu
+- am
+- yi
+- lo
+- uz
+- fo
+- ht
+- ps
+- tk
+- nn
+- mt
+- sa
+- lb
+- my
+- bo
+- tl
+- mg
+- as
+- tt
+- haw
+- ln
+- ha
+- ba
+- jw
+- su
+- yue
+tags:
+- audio
+- automatic-speech-recognition
+- speech-recognition
+- whisper
+- annthem
+- qlip
+- thestage
+---
+# Elastic model: Whisper Large v3. Fastest and most flexible models for self-serving.
+Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
+* __S__: The fastest model with optimized performance and minimal quality degradation, offering the best speed-accuracy tradeoff for production deployments.
+__Goals of elastic models:__
+* Provide flexibility in cost vs quality selection for inference
+* Provide clear quality and latency benchmarks for speech recognition
+* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions
+* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT
+* Provide the best models and service for self-hosting
+> It's important to note that we have consolidated all elastic model versions into a single optimized S model that provides the best balance of speed and quality for Whisper Large v3.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/V8hpZ-cA9vE5Ijyodp6Ih.png)
+## Audio Examples
+Below are examples demonstrating the transcription quality of the Elastic Whisper Large v3 S model compared to the original.
+**Example Audio Transcriptions:**
+| Audio Sample | Original Whisper Large v3 | Elastic S Model |
+|-------------|---------------------------|-----------------|
+| Sample 1    | [Transcription placeholder] | [Transcription placeholder] |
+| Sample 2    | [Transcription placeholder] | [Transcription placeholder] |
+| Sample 3    | [Transcription placeholder] | [Transcription placeholder] |
+-----
+## Inference
+To infer our Whisper models, you primarily use the `elastic_models.transformers.WhisperForConditionalGeneration` class.
+**Example using `elastic_models` with the optimized model:**
+```python
+import torch
+import librosa
+from transformers import AutoProcessor
+from elastic_models.transformers import WhisperForConditionalGeneration
+model_name = "openai/whisper-large-v3"
+mode = "S"
+audio_path = "path_to_your_audio.wav"
+hf_token = "YOUR_TOKEN"
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# Load processor and model
+processor = AutoProcessor.from_pretrained(model_name, token=hf_token)
+model = WhisperForConditionalGeneration.from_pretrained(
+    model_name,
+    token=hf_token,
+    torch_dtype=torch.float16,
+    mode=mode,
+    device_map=device,
+)
+model.eval()
+# Load and process audio
+audio, sr = librosa.load(audio_path, sr=16000)
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+inputs = inputs.to(device)
+print(f"Transcribing audio from: {audio_path}")
+generate_kwargs = {"max_new_tokens": 100, "num_beams": 1}
+# Generate transcription
+with torch.inference_mode():
+    generate_ids = model.generate(**inputs, **generate_kwargs)
+# Decode the transcription
+transcription = processor.batch_decode(
+    generate_ids,
+    skip_special_tokens=True,
+    clean_up_tokenization_spaces=False
+)[0]
+print(f"Transcription: {transcription}")
+```
+__System requirements:__
+* GPUs: NVIDIA GeForce RTX 4090, GeForce RTX 5090, L40S
+* CPU: AMD, Intel
+* Python: 3.8-3.12 (check dependencies for specific versions)
+To work with our elastic models and compilation tools, you'll need to install `elastic_models` and `qlip` libraries from TheStage:
+```shell
+pip install thestage
+pip install 'thestage-elastic-models[nvidia]'
+pip install flash-attn==2.7.3 --no-build-isolation
+pip uninstall apex
+```
+Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
+```shell
+thestage config set --api-token <YOUR_API_TOKEN>
+```
+Congrats, now you can use accelerated models and tools!
+----
+## Benchmarks
+Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for Whisper models using our algorithms.
+### Quality benchmarks
+Performance evaluation on standard speech recognition benchmarks:
+| Metric/Model | S | Original |
+|--------------|---|----------|
+| WER (Common Voice) | [TBD] | [TBD] |
+* **WER (Word Error Rate)**: The primary metric for evaluating speech recognition accuracy. Lower is better.
+* **Common Voice**: Multilingual speech recognition benchmark covering diverse languages and accents.
+### Latency benchmarks (ms)
+Performance for transcribing audio (ms):
+**Batch Size 1:**
+| GPU Type | S | Original |
+|----------|---|----------|
+| GeForce RTX 4090 | [TBD] | [TBD] |
+| GeForce RTX 5090 | [TBD] | [TBD] |
+| L40S | [TBD] | [TBD] |
+## Links
+* __Platform__: [app.thestage.ai](https://app.thestage.ai)
+* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI)
+* __Contact email__: [email protected]