|
--- |
|
datasets: |
|
- FBK-MT/mosel |
|
- facebook/covost2 |
|
- openslr/librispeech_asr |
|
- facebook/voxpopuli |
|
language: |
|
- en |
|
- it |
|
license: cc-by-4.0 |
|
metrics: |
|
- wer |
|
tags: |
|
- speech |
|
- speech recognition |
|
- ASR |
|
pipeline_tag: automatic-speech-recognition |
|
library_name: transformers |
|
--- |
|
|
|
# FAMA-small-asr |
|
<div> |
|
<img src="FAMA.png" width="100%" alt="FAMA" /> |
|
</div> |
|
|
|
## Table of Contents |
|
1. [Overview](#overview) |
|
2. [Usage](#Usage) |
|
3. [Results](#Results) |
|
4. [License](#license) |
|
5. [Citation](#citation) |
|
|
|
## Overview |
|
|
|
FAMA is the first family of large-scale open-science SFMs for English and |
|
Italian trained on [over 150k hours of exclusively open-source(OS)-compliant speech data](https://huggingface.co/datasets/FBK-MT/fama-data). |
|
|
|
FAMA models achieve [remarkable results](#results), with ASR and ST improvements on average across languages compared to OWSM, |
|
and is competitive in terms of ASR performance with the Whisper model family while being up to 8 times faster. |
|
|
|
All the artifacts used for realizing FAMA models, including codebase, datasets, and models |
|
themself are [released under OS-compliant licenses](#license), promoting a more |
|
responsible creation of models in our community. |
|
|
|
It is available in 2 sizes, with 2 variants for ASR only: |
|
|
|
- [FAMA-small](https://huggingface.co/FBK-MT/fama-small) - 475 million parameters |
|
- [FAMA-medium](https://huggingface.co/FBK-MT/fama-medium) - 878 million parameters |
|
- [FAMA-small-asr](https://huggingface.co/FBK-MT/fama-small-asr) - 475 million parameters |
|
- [FAMA-medium-asr](https://huggingface.co/FBK-MT/fama-medium-asr) - 878 million parameters |
|
|
|
For further details, please refer to the paper [FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian](https://huggingface.co/papers/2505.22759). |
|
The code is available in the [Github repository](https://github.com/hlt-mt/FBK-fairseq). |
|
|
|
## Usage |
|
|
|
FAMA models are supported in Hugging Face π€ Transformers. |
|
To run the model, first install the Transformers and Datasets libraries. |
|
|
|
```sh |
|
pip install transformers==4.48.1 datasets |
|
``` |
|
|
|
To perform a single inference on a sample audio file using the |
|
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) |
|
class, run: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoProcessor, pipeline |
|
from datasets import load_dataset |
|
|
|
model_id = "FBK-MT/fama-small-asr" |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
tgt_lang = "en" |
|
|
|
# Force the model to start with the language tag |
|
lang_tag = "<lang:{}>".format(tgt_lang) |
|
lang_tag_id = processor.tokenizer.convert_tokens_to_ids(lang_tag) |
|
|
|
generate_kwargs = {"num_beams": 5, "no_repeat_ngram_size": 5, "forced_bos_token_id": lang_tag_id} |
|
|
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model=model_id, |
|
trust_remote_code=True, |
|
torch_dtype=torch.float32, |
|
device=device, |
|
return_timestamps=False, |
|
generate_kwargs=generate_kwargs |
|
) |
|
|
|
dataset = load_dataset("distil-whisper/librispeech_asr_dummy", "clean", split="validation") |
|
sample = dataset[0]["audio"] |
|
|
|
result = pipe(sample) |
|
print(result["text"]) |
|
``` |
|
|
|
Where `tgt_lang` is the target language (either `en` or `it`). The source languages has not to be specified. |
|
To run the inference on a local audio file `audio.wav`, call the pipeline with: |
|
|
|
```python |
|
result = pipe("audio.wav") |
|
``` |
|
|
|
To perform a batch inference with size `batch_size`, run: |
|
|
|
```python |
|
result = pipe(["audio_1.wav", "audio_2.wav"], batch_size=2) |
|
``` |
|
|
|
For the inference, we suggest converting the audio files in wav format with 16kHz sampling rate and 1 channel. |
|
|
|
## Results |
|
|
|
We evaluate FAMA-ASR on ASR using popular open-source datasets such as CommonVoice, Multilingual LibriSpeech (MLS), and VoxPopuli. |
|
The metric used is WER (β). |
|
|
|
We also benchmark FAMA in terms of computational time and maximum batch size supported on HuggingFace against Whisper and SeamlessM4T models. The metric used is the inverse real time factor (xRTF). |
|
|
|
**Key highlights:** |
|
- FAMA achieves up to 4.2 WER improvement on average across languages compared to OWSM v3.1 |
|
- FAMA is up to 8 times faster than Whisper large-v3 while achieving comparable performance |
|
|
|
### Automatic Speech Recogniton (ASR) |
|
| ***Model/Dataset WER (β)*** | **CommonVoice**-*en* | **CommonVoice**-*it* | **MLS**-*en* | **MLS**-*it* | **VoxPopuli**-*en* | **VoxPopuli**-*it* | **AVG**-*en* | **AVG**-*it* | |
|
|-----------------------------------------|---------|---------|---------|---------|---------|----------|---------|----------| |
|
| Whisper *medium* | 14.5 | 10.4 | 14.2 | 15.9 | 8.1 | 26.8 | 12.3 | 17.7 | |
|
| Whisper *large-v3* | 11.2 | 6.5 | **5.0** | 8.8 | 7.1 | 18.8 | 7.8 | 11.4 | |
|
| OWSM v3.1 *medium* | 11.9 | 12.5 | 6.6 | 19.3 | 8.4 | 24.0 | 9.0 | 18.6 | |
|
| SeamlessM4T *medium* | 10.7 | 7.8 | 8.8 | 11.3 | 10.2 | 18.2 | 9.9 | 12.4 | |
|
| SeamlessM4T *v2-large* | **7.7** | **5.0** | 6.4 | **8.5** | **6.9** | 16.6 | **7.0** | **10.0** | |
|
| FAMA-ASR *small* | 13.8 | 8.9 | 5.8 | 12.6 | 7.2 | 15.7 | 8.9 | 12.4 | |
|
| FAMA-ASR *medium* | 11.7 | 7.1 | 5.1 | 12.2 | 7.0 | 15.9 | 7.9 | 11.7 | |
|
| FAMA *small* | 13.7 | 8.6 | 5.8 | 12.8 | 7.3 | **15.6** | 8.9 | 12.3 | |
|
| FAMA *medium* | 11.5 | 7.0 | 5.2 | 13.9 | 7.2 | 15.9 | 8.0 | 12.3 | |
|
|
|
### Computational Time and Maximum Batch Size |
|
|
|
| ***Model*** | ***Batch Size*** | ***xRTF en (β)*** | ***xRTF it (β)*** | ***xRTF AVG (β)*** | |
|
|------------------------|------------|-------------|-------------|--------------| |
|
| Whisper *medium* | 8 | 13.3 | 10.9 | 12.1 | |
|
| Whisper *large-v3* | 4 | 7.9 | 6.5 | 7.2 | |
|
| SeamlessM4T *medium* | 2 | 28.5 | 26.2 | 27.4 | |
|
| SeamlessM4T *v2-large* | 2 | 13.7 | 13.3 | 13.5 | |
|
| FAMA *small* | 16 | **57.4** | **56.0** | **56.7** | |
|
| FAMA *medium* | 8 | 39.5 | 41.2 | 40.4 | |
|
|
|
## License |
|
|
|
We release the FAMA model weights, and training data under the CC-BY 4.0 license. |
|
The training data can be found in [FAMA Training Data](https://huggingface.co/datasets/FBK-MT/fama-data). |
|
The [original FBK-fairseq codebase](https://github.com/hlt-mt/FBK-fairseq) used to train the model is released under the Apache 2.0 license. |
|
|
|
## Citation |
|
|
|
If you use FAMA in your work, please cite: |
|
|
|
``` |
|
@misc{papi2025fama, |
|
title={FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian}, |
|
author={Sara Papi and Marco Gaido and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri}, |
|
year={2025} |
|
} |
|
``` |