ABR's asr-19m-v2-en-32b SSM
The asr-19m-v2-en-32b model is a State Space Model (SSM) with attention that performs automatic speech recognition (ASR), trained and released by Applied Brain Research (ABR). This model contains ~19m parameters, transcribes speech in English, and was trained on 15k hours of speech data (competitors use about 200k hours). SSMs are an ideal solution for streaming contexts, but to provide a more direct comparison with other models in the leaderboard this model is not streaming. Variants of this model for streaming and other languages are available from ABR.
Usage
Install requirements
pip install datasets torch torchcodec transformers sentencepiece
Automatically instantiate the model
import torch
from datasets import load_dataset
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
model_id = "abr-ai/asr-19m-v2-en-32b"
feature_extractor = AutoFeatureExtractor.from_pretrained(
model_id, trust_remote_code=True
)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
Transcribing using Python
First, we need a sample of english speech data
dataset = load_dataset("librispeech_asr", "clean", split="test", streaming=True)
samples = list(dataset.take(3)) # Take 3 examples
Then do:
audio = samples[0]["audio"]["array"]
features = feature_extractor(audio)
logits = model(features)
transcription = tokenizer.decode_from_logits(logits)
print(transcription)
Transcribing many audio files
audio_list = [sample["audio"]["array"] for sample in samples]
batch_features = feature_extractor(audio_list)
batch_outputs = model(batch_features["input_features"], mask=batch_features["mask"])
transcriptions = tokenizer.decode_from_logits(
batch_outputs["logits"], mask=batch_outputs["mask"]
)
for t in transcriptions:
print(t)
Input
This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given audio sample.
Model Details
The SSM ASR model is trained for English speech recognition and transcribes audio into text. ABR developed the model to demonstrate small, efficient, real-time, accurate speech recognition can be performed with SSMs and run on low cost third party hardware. ABR also provides a custom low cost chip where similar models run at significantly lower power. The model uses 19m parameters. The version posted here is a non-causal model (like most on the leaderboard), to give fair performance comparisons. It is also available as a cascaded model, meaning it produces extremely low latency (<120ms from first audio to token) causal outputs as well as 1s latency non-causal outputs. These two streams can be merged to have a quick response that updates after1s with the final result.
Release Date: November 18, 2025
Model Type
Automatic speech recognition model transcribing speech audio to text in English.
Model Use
The intended use of the model is for evaluation by AI developers who want extremely small but performant ASR. We recognize that it is not possible to enforce our intended use guidelines. The models should not be used to transcribe individuals without their explicit consent, or be used to infer any particular human features as only text output is generated by the model. Other capabilities have not been evaluated. We recommend against using the model in high-risk settings (such as making important decisions) where errors in the model output can result in significant consequences for users. We strongly recommend that users perform extensive evaluations for their use cases.
Training
The model was trained on datasets partially listed below. It uses MFCC preprocessing on the input and is trained with CTC loss. It uses greedy CTC decoding and sentencepiece tokenization.
Datasets
The datasets include several thousand hours of English speech:
- LibriSpeech (clean)
- VoxPopuli
- GigaSpeech
- Common Voice
- TED-LIUM
- Europarl
- Earnings-22
- AMI-IHM
- SPGISpeech
Performance
Our evaluations show that the SSM ASR demonstrates better performance on benchmark datasets compared to other similarly sized or often larger ASR models. The posted model transcribes audio to lower case english with no punctuation. Performance is reported in terms of Word Error Rate (WER%) for the non-causal model.
Average WER = 10.61 %
| Dataset | WER |
|---|---|
| AMI-IHM | 18.76% |
| Earnings-22 | 13.53% |
| GigaSpeech | 15.44% |
| LibriSpeech (clean) | 4.66% |
| LibriSpeech (other) | 11.16% |
| SPGISpeech | 3.94% |
| TED-LIUM | 7.53% |
| VoxPopuli | 9.88% |
Limitations
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, significant noise, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. The model might generate text that was not actually spoken in the input audio. Broader Implications We intend for the SSM ASR model to be used for beneficial purposes, including low cost transcription on low cost hardware, providing accessibility, and improving real-time voice user interfaces. As with all AI technology, there are also reasons to be concerned about dual-use. For instance, lowering the cost may allow broader deployment of undesired surveillance technology or the inexpensive scaling of existing technology. Related safety concerns come from the model being used to identify individuals or being deployable in very small footprint hardware.
License
This model is made available under ABR's open license.
Citation
@misc{
AppliedBrainResearch2025,
author = {Applied Brain Research, Inc},
title = {asr-19m-v2-en-32b},
year = {2025},
publisher = {HuggingFace},
journal = {HuggingFace repository},
howpublished = {\url{https://huggingface.co/abr-ai/asr-19m-v2-en-32b}},
}
- Downloads last month
- 250
Space using abr-ai/asr-19m-v2-en-32b 1
Evaluation results
- WER% on AMItest set Open ASR Leaderboard18.760
- WER% on Earnings22test set Open ASR Leaderboard13.530
- WER% on Gigaspeechtest set Open ASR Leaderboard15.440
- WER% on LibriSpeech CleanOpen ASR Leaderboard4.660
- WER% on LibriSpeech OtherOpen ASR Leaderboard11.160
- WER% on SPGISpeechtest set Open ASR Leaderboard3.940
- WER% on Tedliumtest set Open ASR Leaderboard7.530
- WER% on Voxpopulitest set Open ASR Leaderboard9.880