FAMA-small-asr

Overview
Usage
Results
License
Citation

Overview

FAMA is the first family of large-scale open-science SFMs for English and Italian trained on over 150k hours of exclusively open-source(OS)-compliant speech data.

FAMA models achieve remarkable results, with ASR and ST improvements on average across languages compared to OWSM, and is competitive in terms of ASR performance with the Whisper model family while being up to 8 times faster.

All the artifacts used for realizing FAMA models, including codebase, datasets, and models themself are released under OS-compliant licenses, promoting a more responsible creation of models in our community.

It is available in 2 sizes, with 2 variants for ASR only:

FAMA-small - 475 million parameters
FAMA-medium - 878 million parameters
FAMA-small-asr - 475 million parameters
FAMA-medium-asr - 878 million parameters

For further details, please refer to the paper FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian. The code is available in the Github repository.

Usage

FAMA models are supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers and Datasets libraries.

pip install transformers==4.48.1 datasets

To perform a single inference on a sample audio file using the pipeline class, run:

import torch
from transformers import AutoProcessor, pipeline
from datasets import load_dataset

model_id = "FBK-MT/fama-small-asr"
processor = AutoProcessor.from_pretrained(model_id)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
tgt_lang = "en"

# Force the model to start with the language tag
lang_tag = "<lang:{}>".format(tgt_lang)
lang_tag_id = processor.tokenizer.convert_tokens_to_ids(lang_tag)

generate_kwargs = {"num_beams": 5, "no_repeat_ngram_size": 5, "forced_bos_token_id": lang_tag_id}

pipe = pipeline(
    "automatic-speech-recognition",
    model=model_id,
    trust_remote_code=True,
    torch_dtype=torch.float32,
    device=device,
    return_timestamps=False,
    generate_kwargs=generate_kwargs
)

dataset = load_dataset("distil-whisper/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Where tgt_lang is the target language (either en or it). The source languages has not to be specified. To run the inference on a local audio file audio.wav, call the pipeline with:

result = pipe("audio.wav")

To perform a batch inference with size batch_size, run:

result = pipe(["audio_1.wav", "audio_2.wav"], batch_size=2)

For the inference, we suggest converting the audio files in wav format with 16kHz sampling rate and 1 channel.

Results

We evaluate FAMA-ASR on ASR using popular open-source datasets such as CommonVoice, Multilingual LibriSpeech (MLS), and VoxPopuli. The metric used is WER (↓).

We also benchmark FAMA in terms of computational time and maximum batch size supported on HuggingFace against Whisper and SeamlessM4T models. The metric used is the inverse real time factor (xRTF).

Key highlights:

FAMA achieves up to 4.2 WER improvement on average across languages compared to OWSM v3.1
FAMA is up to 8 times faster than Whisper large-v3 while achieving comparable performance

Automatic Speech Recogniton (ASR)

Model/Dataset WER (↓)	CommonVoice-en	CommonVoice-it	MLS-en	MLS-it	VoxPopuli-en	VoxPopuli-it	AVG-en	AVG-it
Whisper medium	14.5	10.4	14.2	15.9	8.1	26.8	12.3	17.7
Whisper large-v3	11.2	6.5	5.0	8.8	7.1	18.8	7.8	11.4
OWSM v3.1 medium	11.9	12.5	6.6	19.3	8.4	24.0	9.0	18.6
SeamlessM4T medium	10.7	7.8	8.8	11.3	10.2	18.2	9.9	12.4
SeamlessM4T v2-large	7.7	5.0	6.4	8.5	6.9	16.6	7.0	10.0
FAMA-ASR small	13.8	8.9	5.8	12.6	7.2	15.7	8.9	12.4
FAMA-ASR medium	11.7	7.1	5.1	12.2	7.0	15.9	7.9	11.7
FAMA small	13.7	8.6	5.8	12.8	7.3	15.6	8.9	12.3
FAMA medium	11.5	7.0	5.2	13.9	7.2	15.9	8.0	12.3

Computational Time and Maximum Batch Size

Model	Batch Size	xRTF en (↑)	xRTF it (↑)	xRTF AVG (↑)
Whisper medium	8	13.3	10.9	12.1
Whisper large-v3	4	7.9	6.5	7.2
SeamlessM4T medium	2	28.5	26.2	27.4
SeamlessM4T v2-large	2	13.7	13.3	13.5
FAMA small	16	57.4	56.0	56.7
FAMA medium	8	39.5	41.2	40.4

License

We release the FAMA model weights, and training data under the CC-BY 4.0 license. The training data can be found in FAMA Training Data. The original FBK-fairseq codebase used to train the model is released under the Apache 2.0 license.

Citation

If you use FAMA in your work, please cite:

@misc{papi2025fama,
      title={FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian}, 
      author={Sara Papi and Marco Gaido and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
      year={2025}
}

FBK-MT
/

fama-small-asr

FAMA-small-asr

Table of Contents

Overview

Usage

Results

Automatic Speech Recogniton (ASR)

Computational Time and Maximum Batch Size

License

Citation

Datasets used to train FBK-MT/fama-small-asr

Spaces using FBK-MT/fama-small-asr 2

Collection including FBK-MT/fama-small-asr

FAMA