Title card

Quantum_STT_V2.0

Model architecture | Model size | Language

Description:

Quantum_STT_V2.0 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0

This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass.

Key Features

  • Accurate word-level timestamp predictions
  • Automatic punctuation and capitalization
  • Robust performance on spoken numbers, and song lyrics transcription

This model is ready for commercial/non-commercial use.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license.

Deployment Geography:

Global

Use Case:

This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.

Release Date:

14/05/2025

Model Architecture:

Architecture Type:

FastConformer-TDT

Network Architecture:

  • This model was developed based on FastConformer encoder architecture[1] and TDT decoder[2]
  • This model has 600 million model parameters.

Input:

  • Input Type(s): 16kHz Audio
  • Input Format(s): .wav and .flac audio formats
  • Input Parameters: 1D (audio signal)
  • Other Properties Related to Input: Monochannel audio

Output:

  • Output Type(s): Text
  • Output Format: String
  • Output Parameters: 1D (text)
  • Other Properties Related to Output: Punctuations and Capitalizations included.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

How to Use this Model:

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install -U nemo_toolkit["asr"]

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0")

Transcribing using Python

First, let's get a sample

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)

Transcribing with timestamps

To transcribe with timestamps:

output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

Software Integration:

Runtime Engine(s):

  • NeMo 2.2

[Preferred/Supported] Operating System(s):

  • Linux

Hardware Specific Requirements:

Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports.

Model Version

Current version: Quantum_STT_V2.0. Previous versions can be accessed here.

Performance

Huggingface Open-ASR-Leaderboard Performance

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Base Performance

The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

Model Avg WER AMI Earnings-22 GigaSpeech LS test-clean LS test-other SPGI Speech TEDLIUM-v3 VoxPopuli
Quantum_STT_V2.0 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95

Noise Robustness

Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

SNR Level Avg WER AMI Earnings GigaSpeech LS test-clean LS test-other SPGI Tedlium VoxPopuli Relative Change
Clean 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95 -
SNR 50 6.04 11.11 11.12 9.74 1.70 3.18 2.18 3.34 5.98 +0.25%
SNR 25 6.50 12.76 11.50 9.98 1.78 3.63 2.54 3.46 6.34 -7.04%
SNR 5 8.39 19.33 13.83 11.28 2.36 5.50 3.91 3.91 6.96 -38.11%

Telephony Audio Performance

Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

Audio Format Avg WER AMI Earnings GigaSpeech LS test-clean LS test-other SPGI Tedlium VoxPopuli Relative Change
Standard 16kHz 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95 -
μ-law 8kHz 6.32 11.98 11.16 10.02 1.78 3.52 2.20 3.38 6.52 -4.10%

These WER scores were obtained using greedy decoding without an external language model.

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Quantamhash/Quantum_STT_V2.0

Unable to build the model tree, the base model loops to the model itself. Learn more.

Space using Quantamhash/Quantum_STT_V2.0 1

Evaluation results