--- license: apache-2.0 base_model: - openai/whisper-large-v3 base_model_relation: quantized pipeline_tag: automatic-speech-recognition language: - en - zh - de - es - ru - ko - fr - ja - pt - tr - pl - ca - nl - ar - sv - it - id - hi - fi - vi - he - uk - el - ms - cs - ro - da - hu - ta - no - th - ur - hr - bg - lt - la - mi - ml - cy - sk - te - fa - lv - bn - sr - az - sl - kn - et - mk - br - eu - is - hy - ne - mn - bs - kk - sq - sw - gl - mr - pa - si - km - sn - yo - so - af - oc - ka - be - tg - sd - gu - am - yi - lo - uz - fo - ht - ps - tk - nn - mt - sa - lb - my - bo - tl - mg - as - tt - haw - ln - ha - ba - jw - su - yue tags: - audio - automatic-speech-recognition - speech-recognition - whisper - annthem - qlip - thestage --- # Elastic model: Whisper Large v3. Fastest and most flexible models for self-serving. Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models: * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks. * __M__: Faster model, with accuracy degradation less than 1.5%. * __S__: The fastest model, with accuracy degradation less than 2%. __Goals of elastic models:__ * Provide flexibility in cost vs quality selection for inference * Provide clear quality and latency benchmarks for speech recognition * Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions * Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT * Provide the best models and service for self-hosting > It's important to note that we have consolidated all elastic model versions into a single optimized S model that provides the best balance of speed and quality for Whisper Large v3. ## Audio Examples Below are examples demonstrating the transcription quality of the Elastic Whisper Large v3 S model compared to the original. **Example Audio Transcriptions:** | Audio Sample | Original Whisper Large v3 | Elastic S Model | |---|---|---| | | joel keaton disapproved of films and buster also had reservations about the medium | joel keaton disapproved of films and buster also had reservations about the medium | | | she ll be alright | she ll be alright | | | all is well that ends well | all is well that ends well | ## Inference To infer our Whisper models, you primarily use the `elastic_models.transformers.WhisperForConditionalGeneration` class. **Example using `elastic_models` with the optimized model:** ```python import torch import librosa # check that you have this package installed from transformers import AutoProcessor from transformers.pipelines import pipeline from elastic_models.transformers import WhisperForConditionalGeneration model_name = "openai/whisper-large-v3" mode = "S" audio_path = "path_to_your_audio.wav" hf_token = "YOUR_TOKEN" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load processor and model processor = AutoProcessor.from_pretrained(model_name, token=hf_token) model = WhisperForConditionalGeneration.from_pretrained( model_name, token=hf_token, torch_dtype=torch.float16, mode=mode, device_map=device, ) model.eval() # Create pipeline generator = pipeline( task="automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device, ) # Load audio audio, sr = librosa.load(audio_path, sr=16000) print(f"Transcribing audio from: {audio_path}") # Generate transcription using pipeline generate_kwargs = { "max_new_tokens": 100, "num_beams": 1, } result = generator( audio, generate_kwargs=generate_kwargs, ) transcription = result["text"] print(f"Transcription: {transcription}") ``` __System requirements:__ * GPUs: NVIDIA GeForce 4090, NVIDIA GeForce 5090, H100, L40S * CPU: AMD, Intel * Python: 3.8-3.12 (check dependencies for specific versions) To work with our elastic models and compilation tools, you'll need to install `elastic_models` and `qlip` libraries from TheStage: ```shell pip install thestage pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple pip install flash-attn==2.7.3 --no-build-isolation pip install tensorrt==10.11.0.33 # for 4090 pip uninstall apex # or for blackwell support pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 # please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl pip install tensorrt==10.11.0.33 pip uninstall apex ``` Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows: ```shell thestage config set --api-token ``` Congrats, now you can use accelerated models and tools! ---- ## Benchmarks Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for Whisper models using our algorithms. ### Quality benchmarks Performance evaluation on standard speech recognition benchmarks: | Metric/Model | S | Original | |--------------|---|----------| | WER (Common Voice) | 0.18 | 0.22 | * **WER (Word Error Rate)**: The primary metric for evaluating speech recognition accuracy. Lower is better. * **Common Voice**: Multilingual speech recognition benchmark covering diverse languages and accents. ### Latency benchmarks (tps) Performance for transcribing audio (tps): **Batch Size 1:** | GPU Type | S | Original | |----------|---|----------| | H100 | 223.47 | 82.84 | | L40S | 210.67 | 72.36 | | GeForce RTX 4090 | 240 | 86.63 | | GeForce RTX 5090 | 265.93 | 195.76 | ## Links * __Platform__: [app.thestage.ai](https://app.thestage.ai) * __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI) * __Contact email__: contact@thestage.ai