Quantum_STT_V2.0

| |

Description:

Quantum_STT_V2.0 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0

This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass.

Key Features

Accurate word-level timestamp predictions
Automatic punctuation and capitalization
Robust performance on spoken numbers, and song lyrics transcription

This model is ready for commercial/non-commercial use.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license.

Deployment Geography:

Global

Use Case:

This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.

Release Date:

14/05/2025

Model Architecture:

Architecture Type:

FastConformer-TDT

Network Architecture:

This model was developed based on FastConformer encoder architecture[1] and TDT decoder[2]
This model has 600 million model parameters.

Input:

Input Type(s): 16kHz Audio
Input Format(s): .wav and .flac audio formats
Input Parameters: 1D (audio signal)
Other Properties Related to Input: Monochannel audio

Output:

Output Type(s): Text
Output Format: String
Output Parameters: 1D (text)
Other Properties Related to Output: Punctuations and Capitalizations included.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

How to Use this Model:

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install -U nemo_toolkit["asr"]

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0")

Transcribing using Python

First, let's get a sample

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)

Transcribing with timestamps

To transcribe with timestamps:

output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

Software Integration:

Runtime Engine(s):

NeMo 2.2

[Preferred/Supported] Operating System(s):

Linux

Hardware Specific Requirements:

Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports.

Model Version

Current version: Quantum_STT_V2.0. Previous versions can be accessed here.

Performance

Huggingface Open-ASR-Leaderboard Performance

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Base Performance

The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

Model	Avg WER	AMI	Earnings-22	GigaSpeech	LS test-clean	LS test-other	SPGI Speech	TEDLIUM-v3	VoxPopuli
Quantum_STT_V2.0	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95

Noise Robustness

Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

SNR Level	Avg WER	AMI	Earnings	GigaSpeech	LS test-clean	LS test-other	SPGI	Tedlium	VoxPopuli	Relative Change
Clean	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95	-
SNR 50	6.04	11.11	11.12	9.74	1.70	3.18	2.18	3.34	5.98	+0.25%
SNR 25	6.50	12.76	11.50	9.98	1.78	3.63	2.54	3.46	6.34	-7.04%
SNR 5	8.39	19.33	13.83	11.28	2.36	5.50	3.91	3.91	6.96	-38.11%

Telephony Audio Performance

Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

Audio Format	Avg WER	AMI	Earnings	GigaSpeech	LS test-clean	LS test-other	SPGI	Tedlium	VoxPopuli	Relative Change
Standard 16kHz	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95	-
μ-law 8kHz	6.32	11.98	11.16	10.02	1.78	3.52	2.20	3.38	6.52	-4.10%

These WER scores were obtained using greedy decoding without an external language model.

Quantamhash
/

Quantum_STT_V2.0