--- license: cc-by-4.0 language: - en pipeline_tag: automatic-speech-recognition library_name: nemo thumbnail: null tags: - automatic-speech-recognition - speech - audio - Transducer - TDT - FastConformer - Conformer - pytorch - NeMo - hf-asr-leaderboard widget: - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac model-index: - name: Quantum_STT_V2.0 results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: AMI (Meetings test) type: edinburghcstr/ami config: ihm split: test args: language: en metrics: - name: Test WER type: wer value: 11.16 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Earnings-22 type: revdotcom/earnings22 split: test args: language: en metrics: - name: Test WER type: wer value: 11.15 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: GigaSpeech type: speechcolab/gigaspeech split: test args: language: en metrics: - name: Test WER type: wer value: 9.74 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: other split: test args: language: en metrics: - name: Test WER type: wer value: 1.69 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - name: Test WER type: wer value: 3.19 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: SPGI Speech type: kensho/spgispeech config: test split: test args: language: en metrics: - name: Test WER type: wer value: 2.17 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: tedlium-v3 type: LIUM/tedlium config: release1 split: test args: language: en metrics: - name: Test WER type: wer value: 3.38 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Vox Populi type: facebook/voxpopuli config: en split: test args: language: en metrics: - name: Test WER type: wer value: 5.95 metrics: - wer base_model: - Quantamhash/Quantum_STT ---
Title card
# **Quantum_STT_V2.0** [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--TDT-blue#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-0.6B-green#model-badge)](#model-architecture) | [![Language](https://img.shields.io/badge/Language-en-orange#model-badge)](#datasets) ## Description: `Quantum_STT_V2.0` is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0 This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. **Key Features** - Accurate word-level timestamp predictions - Automatic punctuation and capitalization - Robust performance on spoken numbers, and song lyrics transcription This model is ready for commercial/non-commercial use. ## License/Terms of Use: GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license. ### Deployment Geography: Global ### Use Case: This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms. ### Release Date: 14/05/2025 ### Model Architecture: **Architecture Type**: FastConformer-TDT **Network Architecture**: * This model was developed based on [FastConformer encoder](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) architecture[1] and TDT decoder[2] * This model has 600 million model parameters. ### Input: - **Input Type(s):** 16kHz Audio - **Input Format(s):** `.wav` and `.flac` audio formats - **Input Parameters:** 1D (audio signal) - **Other Properties Related to Input:** Monochannel audio ### Output: - **Output Type(s):** Text - **Output Format:** String - **Output Parameters:** 1D (text) - **Other Properties Related to Output:** Punctuations and Capitalizations included. Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. ## How to Use this Model: To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version. ```bash pip install -U nemo_toolkit["asr"] ``` The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. #### Automatically instantiate the model ```python import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0") ``` #### Transcribing using Python First, let's get a sample ```bash wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav ``` Then simply do: ```python output = asr_model.transcribe(['2086-149220-0033.wav']) print(output[0].text) ``` #### Transcribing with timestamps To transcribe with timestamps: ```python output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True) # by default, timestamps are enabled for char, word and segment level word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample segment_timestamps = output[0].timestamp['segment'] # segment level timestamps char_timestamps = output[0].timestamp['char'] # char level timestamps for stamp in segment_timestamps: print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}") ``` ## Software Integration: **Runtime Engine(s):** * NeMo 2.2 **[Preferred/Supported] Operating System(s):** - Linux **Hardware Specific Requirements:** Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports. #### Model Version Current version: Quantum_STT_V2.0. Previous versions can be [accessed](https://huggingface.co/Quantamhash/Quantum_STT) here. ## Performance #### Huggingface Open-ASR-Leaderboard Performance The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio. ### Base Performance The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model): | **Model** | **Avg WER** | **AMI** | **Earnings-22** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI Speech** | **TEDLIUM-v3** | **VoxPopuli** | |:-------------|:-------------:|:---------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:------------------:|:----------------:|:---------------:| | Quantum_STT_V2.0 | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | ### Noise Robustness Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples: | **SNR Level** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** | |:---------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:| | Clean | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | | SNR 50 | 6.04 | 11.11 | 11.12 | 9.74 | 1.70 | 3.18 | 2.18 | 3.34 | 5.98 | +0.25% | | SNR 25 | 6.50 | 12.76 | 11.50 | 9.98 | 1.78 | 3.63 | 2.54 | 3.46 | 6.34 | -7.04% | | SNR 5 | 8.39 | 19.33 | 13.83 | 11.28 | 2.36 | 5.50 | 3.91 | 3.91 | 6.96 | -38.11% | ### Telephony Audio Performance Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion): | **Audio Format** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** | |:-----------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:| | Standard 16kHz | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | | μ-law 8kHz | 6.32 | 11.98 | 11.16 | 10.02 | 1.78 | 3.52 | 2.20 | 3.38 | 6.52 | -4.10% | These WER scores were obtained using greedy decoding without an external language model.