Quantum_STT_V2.0 / README.md
sbapan41's picture
Update README.md
a417f9a verified
metadata
license: cc-by-4.0
language:
  - en
pipeline_tag: automatic-speech-recognition
library_name: nemo
thumbnail: null
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - Transducer
  - TDT
  - FastConformer
  - Conformer
  - pytorch
  - NeMo
  - hf-asr-leaderboard
widget:
  - example_title: Librispeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: Librispeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
  - name: Quantum_STT_V2.0
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: AMI (Meetings test)
          type: edinburghcstr/ami
          config: ihm
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 11.16
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Earnings-22
          type: revdotcom/earnings22
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 11.15
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: GigaSpeech
          type: speechcolab/gigaspeech
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 9.74
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (clean)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 1.69
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (other)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 3.19
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: SPGI Speech
          type: kensho/spgispeech
          config: test
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 2.17
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: tedlium-v3
          type: LIUM/tedlium
          config: release1
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 3.38
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Vox Populi
          type: facebook/voxpopuli
          config: en
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 5.95
metrics:
  - wer
base_model:
  - Quantamhash/Quantum_STT
Title card

Quantum_STT_V2.0

Model architecture | Model size | Language

Description:

Quantum_STT_V2.0 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0

This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass.

Key Features

  • Accurate word-level timestamp predictions
  • Automatic punctuation and capitalization
  • Robust performance on spoken numbers, and song lyrics transcription

This model is ready for commercial/non-commercial use.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license.

Deployment Geography:

Global

Use Case:

This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.

Release Date:

14/05/2025

Model Architecture:

Architecture Type:

FastConformer-TDT

Network Architecture:

  • This model was developed based on FastConformer encoder architecture[1] and TDT decoder[2]
  • This model has 600 million model parameters.

Input:

  • Input Type(s): 16kHz Audio
  • Input Format(s): .wav and .flac audio formats
  • Input Parameters: 1D (audio signal)
  • Other Properties Related to Input: Monochannel audio

Output:

  • Output Type(s): Text
  • Output Format: String
  • Output Parameters: 1D (text)
  • Other Properties Related to Output: Punctuations and Capitalizations included.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

How to Use this Model:

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install -U nemo_toolkit["asr"]

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0")

Transcribing using Python

First, let's get a sample

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)

Transcribing with timestamps

To transcribe with timestamps:

output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

Software Integration:

Runtime Engine(s):

  • NeMo 2.2

[Preferred/Supported] Operating System(s):

  • Linux

Hardware Specific Requirements:

Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports.

Model Version

Current version: Quantum_STT_V2.0. Previous versions can be accessed here.

Performance

Huggingface Open-ASR-Leaderboard Performance

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Base Performance

The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

Model Avg WER AMI Earnings-22 GigaSpeech LS test-clean LS test-other SPGI Speech TEDLIUM-v3 VoxPopuli
Quantum_STT_V2.0 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95

Noise Robustness

Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

SNR Level Avg WER AMI Earnings GigaSpeech LS test-clean LS test-other SPGI Tedlium VoxPopuli Relative Change
Clean 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95 -
SNR 50 6.04 11.11 11.12 9.74 1.70 3.18 2.18 3.34 5.98 +0.25%
SNR 25 6.50 12.76 11.50 9.98 1.78 3.63 2.54 3.46 6.34 -7.04%
SNR 5 8.39 19.33 13.83 11.28 2.36 5.50 3.91 3.91 6.96 -38.11%

Telephony Audio Performance

Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

Audio Format Avg WER AMI Earnings GigaSpeech LS test-clean LS test-other SPGI Tedlium VoxPopuli Relative Change
Standard 16kHz 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95 -
μ-law 8kHz 6.32 11.98 11.16 10.02 1.78 3.52 2.20 3.38 6.52 -4.10%

These WER scores were obtained using greedy decoding without an external language model.