ABR's asr-19m-v2-en-32b SSM

The asr-19m-v2-en-32b model is a State Space Model (SSM) with attention that performs automatic speech recognition (ASR), trained and released by Applied Brain Research (ABR). This model contains ~19m parameters, transcribes speech in English, and was trained on 15k hours of speech data (competitors use about 200k hours). SSMs are an ideal solution for streaming contexts, but to provide a more direct comparison with other models in the leaderboard this model is not streaming. Variants of this model for streaming and other languages are available from ABR.

Usage

Install requirements

pip install datasets torch torchcodec transformers sentencepiece

Automatically instantiate the model

import torch
from datasets import load_dataset
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer

model_id = "abr-ai/asr-19m-v2-en-32b"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, trust_remote_code=True
)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

Transcribing using Python

First, we need a sample of english speech data

dataset = load_dataset("librispeech_asr", "clean", split="test", streaming=True)
samples = list(dataset.take(3))  # Take 3 examples

Then do:

audio = samples[0]["audio"]["array"]
features = feature_extractor(audio)
logits = model(features)
transcription = tokenizer.decode_from_logits(logits)
print(transcription)

Transcribing many audio files

audio_list = [sample["audio"]["array"] for sample in samples]
batch_features = feature_extractor(audio_list)
batch_outputs = model(batch_features["input_features"], mask=batch_features["mask"])
transcriptions = tokenizer.decode_from_logits(
    batch_outputs["logits"], mask=batch_outputs["mask"]
)
for t in transcriptions:
    print(t)

Input

This model accepts 16000 Hz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Details

The SSM ASR model is trained for English speech recognition and transcribes audio into text. ABR developed the model to demonstrate small, efficient, real-time, accurate speech recognition can be performed with SSMs and run on low cost third party hardware. ABR also provides a custom low cost chip where similar models run at significantly lower power. The model uses 19m parameters. The version posted here is a non-causal model (like most on the leaderboard), to give fair performance comparisons. It is also available as a cascaded model, meaning it produces extremely low latency (<120ms from first audio to token) causal outputs as well as 1s latency non-causal outputs. These two streams can be merged to have a quick response that updates after1s with the final result.

Release Date: November 18, 2025

Model Type

Automatic speech recognition model transcribing speech audio to text in English.

Model Use

The intended use of the model is for evaluation by AI developers who want extremely small but performant ASR. We recognize that it is not possible to enforce our intended use guidelines. The models should not be used to transcribe individuals without their explicit consent, or be used to infer any particular human features as only text output is generated by the model. Other capabilities have not been evaluated. We recommend against using the model in high-risk settings (such as making important decisions) where errors in the model output can result in significant consequences for users. We strongly recommend that users perform extensive evaluations for their use cases.

Training

The model was trained on datasets partially listed below. It uses MFCC preprocessing on the input and is trained with CTC loss. It uses greedy CTC decoding and sentencepiece tokenization.

Datasets

The datasets include several thousand hours of English speech:

LibriSpeech (clean)
VoxPopuli
GigaSpeech
Common Voice
TED-LIUM
Europarl
Earnings-22
AMI-IHM
SPGISpeech

Performance

Our evaluations show that the SSM ASR demonstrates better performance on benchmark datasets compared to other similarly sized or often larger ASR models. The posted model transcribes audio to lower case english with no punctuation. Performance is reported in terms of Word Error Rate (WER%) for the non-causal model.

Average WER = 10.61 %

Dataset	WER
AMI-IHM	18.76%
Earnings-22	13.53%
GigaSpeech	15.44%
LibriSpeech (clean)	4.66%
LibriSpeech (other)	11.16%
SPGISpeech	3.94%
TED-LIUM	7.53%
VoxPopuli	9.88%

Limitations

Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, significant noise, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. The model might generate text that was not actually spoken in the input audio. Broader Implications We intend for the SSM ASR model to be used for beneficial purposes, including low cost transcription on low cost hardware, providing accessibility, and improving real-time voice user interfaces. As with all AI technology, there are also reasons to be concerned about dual-use. For instance, lowering the cost may allow broader deployment of undesired surveillance technology or the inexpensive scaling of existing technology. Related safety concerns come from the model being used to identify individuals or being deployable in very small footprint hardware.

License

This model is made available under ABR's open license.

Citation

@misc{
  AppliedBrainResearch2025, 
  author = {Applied Brain Research, Inc}, 
  title = {asr-19m-v2-en-32b}, 
  year = {2025}, 
  publisher = {HuggingFace}, 
  journal = {HuggingFace repository}, 
  howpublished = {\url{https://huggingface.co/abr-ai/asr-19m-v2-en-32b}},
}

Downloads last month: 250

Space using abr-ai/asr-19m-v2-en-32b 1

Evaluation results

WER% on AMI
test set Open ASR Leaderboard

18.760
WER% on Earnings22
test set Open ASR Leaderboard

13.530
WER% on Gigaspeech
test set Open ASR Leaderboard

15.440
WER% on LibriSpeech Clean
Open ASR Leaderboard

4.660
WER% on LibriSpeech Other
Open ASR Leaderboard

11.160
WER% on SPGISpeech
test set Open ASR Leaderboard

3.940
WER% on Tedlium
test set Open ASR Leaderboard

7.530
WER% on Voxpopuli
test set Open ASR Leaderboard

9.880

View on Papers With Code