metadata

license: cc-by-4.0
language:
  - en
pipeline_tag: automatic-speech-recognition
library_name: nemo
thumbnail: null
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - Transducer
  - TDT
  - FastConformer
  - Conformer
  - pytorch
  - NeMo
  - hf-asr-leaderboard
widget:
  - example_title: Librispeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: Librispeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
  - name: Quantum_STT_V2.0
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: AMI (Meetings test)
          type: edinburghcstr/ami
          config: ihm
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 11.16
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Earnings-22
          type: revdotcom/earnings22
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 11.15
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: GigaSpeech
          type: speechcolab/gigaspeech
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 9.74
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (clean)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 1.69
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (other)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 3.19
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: SPGI Speech
          type: kensho/spgispeech
          config: test
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 2.17
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: tedlium-v3
          type: LIUM/tedlium
          config: release1
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 3.38
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Vox Populi
          type: facebook/voxpopuli
          config: en
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 5.95
metrics:
  - wer
base_model:
  - Quantamhash/Quantum_STT

Quantum_STT_V2.0

| |

Description:

Quantum_STT_V2.0 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0

This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass.

Key Features

Accurate word-level timestamp predictions
Automatic punctuation and capitalization
Robust performance on spoken numbers, and song lyrics transcription

This model is ready for commercial/non-commercial use.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license.

Deployment Geography:

Global

Use Case:

This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.

Release Date:

14/05/2025

Model Architecture:

Architecture Type:

FastConformer-TDT

Network Architecture:

This model was developed based on FastConformer encoder architecture[1] and TDT decoder[2]
This model has 600 million model parameters.

Input:

Input Type(s): 16kHz Audio
Input Format(s): .wav and .flac audio formats
Input Parameters: 1D (audio signal)
Other Properties Related to Input: Monochannel audio

Output:

Output Type(s): Text
Output Format: String
Output Parameters: 1D (text)
Other Properties Related to Output: Punctuations and Capitalizations included.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

How to Use this Model:

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install -U nemo_toolkit["asr"]

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0")

Transcribing using Python

First, let's get a sample

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)

Transcribing with timestamps

To transcribe with timestamps:

output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

Software Integration:

Runtime Engine(s):

NeMo 2.2

[Preferred/Supported] Operating System(s):

Linux

Hardware Specific Requirements:

Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports.

Model Version

Current version: Quantum_STT_V2.0. Previous versions can be accessed here.

Performance

Huggingface Open-ASR-Leaderboard Performance

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Base Performance

The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

Model	Avg WER	AMI	Earnings-22	GigaSpeech	LS test-clean	LS test-other	SPGI Speech	TEDLIUM-v3	VoxPopuli
Quantum_STT_V2.0	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95

Noise Robustness

Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

SNR Level	Avg WER	AMI	Earnings	GigaSpeech	LS test-clean	LS test-other	SPGI	Tedlium	VoxPopuli	Relative Change
Clean	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95	-
SNR 50	6.04	11.11	11.12	9.74	1.70	3.18	2.18	3.34	5.98	+0.25%
SNR 25	6.50	12.76	11.50	9.98	1.78	3.63	2.54	3.46	6.34	-7.04%
SNR 5	8.39	19.33	13.83	11.28	2.36	5.50	3.91	3.91	6.96	-38.11%

Telephony Audio Performance

Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

Audio Format	Avg WER	AMI	Earnings	GigaSpeech	LS test-clean	LS test-other	SPGI	Tedlium	VoxPopuli	Relative Change
Standard 16kHz	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95	-
μ-law 8kHz	6.32	11.98	11.16	10.02	1.78	3.52	2.20	3.38	6.52	-4.10%

These WER scores were obtained using greedy decoding without an external language model.