license: cc-by-4.0
language:
- en
pipeline_tag: automatic-speech-recognition
library_name: nemo
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- TDT
- FastConformer
- Conformer
- pytorch
- NeMo
- hf-asr-leaderboard
widget:
- example_title: Librispeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: Quantum_STT_V2.0
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: AMI (Meetings test)
type: edinburghcstr/ami
config: ihm
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 11.16
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Earnings-22
type: revdotcom/earnings22
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 11.15
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: GigaSpeech
type: speechcolab/gigaspeech
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 9.74
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 1.69
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 3.19
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: SPGI Speech
type: kensho/spgispeech
config: test
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 2.17
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: tedlium-v3
type: LIUM/tedlium
config: release1
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 3.38
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vox Populi
type: facebook/voxpopuli
config: en
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 5.95
metrics:
- wer
base_model:
- Quantamhash/Quantum_STT

Quantum_STT_V2.0
Description:
Quantum_STT_V2.0
is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0
This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass.
Key Features
- Accurate word-level timestamp predictions
- Automatic punctuation and capitalization
- Robust performance on spoken numbers, and song lyrics transcription
This model is ready for commercial/non-commercial use.
License/Terms of Use:
GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license.
Deployment Geography:
Global
Use Case:
This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.
Release Date:
14/05/2025
Model Architecture:
Architecture Type:
FastConformer-TDT
Network Architecture:
- This model was developed based on FastConformer encoder architecture[1] and TDT decoder[2]
- This model has 600 million model parameters.
Input:
- Input Type(s): 16kHz Audio
- Input Format(s):
.wav
and.flac
audio formats - Input Parameters: 1D (audio signal)
- Other Properties Related to Input: Monochannel audio
Output:
- Output Type(s): Text
- Output Format: String
- Output Parameters: 1D (text)
- Other Properties Related to Output: Punctuations and Capitalizations included.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
How to Use this Model:
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.
pip install -U nemo_toolkit["asr"]
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically instantiate the model
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0")
Transcribing using Python
First, let's get a sample
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
Then simply do:
output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)
Transcribing with timestamps
To transcribe with timestamps:
output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps
for stamp in segment_timestamps:
print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")
Software Integration:
Runtime Engine(s):
- NeMo 2.2
[Preferred/Supported] Operating System(s):
- Linux
Hardware Specific Requirements:
Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports.
Model Version
Current version: Quantum_STT_V2.0. Previous versions can be accessed here.
Performance
Huggingface Open-ASR-Leaderboard Performance
The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.
Base Performance
The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):
Model | Avg WER | AMI | Earnings-22 | GigaSpeech | LS test-clean | LS test-other | SPGI Speech | TEDLIUM-v3 | VoxPopuli |
---|---|---|---|---|---|---|---|---|---|
Quantum_STT_V2.0 | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 |
Noise Robustness
Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:
SNR Level | Avg WER | AMI | Earnings | GigaSpeech | LS test-clean | LS test-other | SPGI | Tedlium | VoxPopuli | Relative Change |
---|---|---|---|---|---|---|---|---|---|---|
Clean | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
SNR 50 | 6.04 | 11.11 | 11.12 | 9.74 | 1.70 | 3.18 | 2.18 | 3.34 | 5.98 | +0.25% |
SNR 25 | 6.50 | 12.76 | 11.50 | 9.98 | 1.78 | 3.63 | 2.54 | 3.46 | 6.34 | -7.04% |
SNR 5 | 8.39 | 19.33 | 13.83 | 11.28 | 2.36 | 5.50 | 3.91 | 3.91 | 6.96 | -38.11% |
Telephony Audio Performance
Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):
Audio Format | Avg WER | AMI | Earnings | GigaSpeech | LS test-clean | LS test-other | SPGI | Tedlium | VoxPopuli | Relative Change |
---|---|---|---|---|---|---|---|---|---|---|
Standard 16kHz | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
μ-law 8kHz | 6.32 | 11.98 | 11.16 | 10.02 | 1.78 | 3.52 | 2.20 | 3.38 | 6.52 | -4.10% |
These WER scores were obtained using greedy decoding without an external language model.