File size: 7,195 Bytes
49f3fcc
 
 
 
 
 
2618e6b
 
 
 
49f3fcc
 
 
 
 
 
2618e6b
650b7ac
49f3fcc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
650b7ac
2618e6b
49f3fcc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
datasets:
- FBK-MT/mosel
- facebook/covost2
- openslr/librispeech_asr
- facebook/voxpopuli
language:
- en
- it
license: cc-by-4.0
metrics:
- wer
tags:
- speech
- speech recognition
- ASR
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

# FAMA-small-asr
<div>
  <img src="FAMA.png" width="100%"  alt="FAMA" />
</div>

## Table of Contents
1. [Overview](#overview)
2. [Usage](#Usage)
3. [Results](#Results)
4. [License](#license)
5. [Citation](#citation)

## Overview

FAMA is the first family of large-scale open-science SFMs for English and
Italian trained on [over 150k hours of exclusively open-source(OS)-compliant speech data](https://huggingface.co/datasets/FBK-MT/fama-data). 

FAMA models achieve [remarkable results](#results), with ASR and ST improvements on average across languages compared to OWSM, 
and is competitive in terms of ASR performance with the Whisper model family while being up to 8 times faster.

All the artifacts used for realizing FAMA models, including codebase, datasets, and models
themself are [released under OS-compliant licenses](#license), promoting a more
responsible creation of models in our community.

It is available in 2 sizes, with 2 variants for ASR only:

- [FAMA-small](https://huggingface.co/FBK-MT/fama-small) - 475 million parameters
- [FAMA-medium](https://huggingface.co/FBK-MT/fama-medium) - 878 million parameters
- [FAMA-small-asr](https://huggingface.co/FBK-MT/fama-small-asr) - 475 million parameters
- [FAMA-medium-asr](https://huggingface.co/FBK-MT/fama-medium-asr) - 878 million parameters

For further details, please refer to the paper [FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian](https://huggingface.co/papers/2505.22759).
The code is available in the [Github repository](https://github.com/hlt-mt/FBK-fairseq).

## Usage

FAMA models are supported in Hugging Face πŸ€— Transformers. 
To run the model, first install the Transformers and Datasets libraries. 

```sh
pip install transformers==4.48.1 datasets
```

To perform a single inference on a sample audio file using the 
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) 
class, run:

```python
import torch
from transformers import AutoProcessor, pipeline
from datasets import load_dataset

model_id = "FBK-MT/fama-small-asr"
processor = AutoProcessor.from_pretrained(model_id)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
tgt_lang = "en"

# Force the model to start with the language tag
lang_tag = "<lang:{}>".format(tgt_lang)
lang_tag_id = processor.tokenizer.convert_tokens_to_ids(lang_tag)

generate_kwargs = {"num_beams": 5, "no_repeat_ngram_size": 5, "forced_bos_token_id": lang_tag_id}

pipe = pipeline(
    "automatic-speech-recognition",
    model=model_id,
    trust_remote_code=True,
    torch_dtype=torch.float32,
    device=device,
    return_timestamps=False,
    generate_kwargs=generate_kwargs
)

dataset = load_dataset("distil-whisper/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])
```

Where `tgt_lang` is the target language (either `en` or `it`). The source languages has not to be specified.
To run the inference on a local audio file `audio.wav`, call the pipeline with:

```python
result = pipe("audio.wav")
```

To perform a batch inference with size `batch_size`, run:

```python
result = pipe(["audio_1.wav", "audio_2.wav"], batch_size=2)
```

For the inference, we suggest converting the audio files in wav format with 16kHz sampling rate and 1 channel.

## Results

We evaluate FAMA-ASR on ASR using popular open-source datasets such as CommonVoice, Multilingual LibriSpeech (MLS), and VoxPopuli.
The metric used is WER (↓).

We also benchmark FAMA in terms of computational time and maximum batch size supported on HuggingFace against Whisper and SeamlessM4T models. The metric used is the inverse real time factor (xRTF).

**Key highlights:**
- FAMA achieves up to 4.2 WER improvement on average across languages compared to OWSM v3.1
- FAMA is up to 8 times faster than Whisper large-v3 while achieving comparable performance

### Automatic Speech Recogniton (ASR)
| ***Model/Dataset WER (↓)***             | **CommonVoice**-*en* | **CommonVoice**-*it* | **MLS**-*en* | **MLS**-*it* | **VoxPopuli**-*en* | **VoxPopuli**-*it* | **AVG**-*en* | **AVG**-*it* |
|-----------------------------------------|---------|---------|---------|---------|---------|----------|---------|----------|
| Whisper *medium*                        | 14.5    | 10.4    | 14.2    | 15.9    | 8.1     | 26.8     | 12.3    | 17.7     |
| Whisper *large-v3*                      | 11.2    | 6.5     | **5.0** | 8.8     | 7.1     | 18.8     | 7.8     | 11.4     | 
| OWSM v3.1 *medium*                      | 11.9    | 12.5    | 6.6     | 19.3    | 8.4     | 24.0     | 9.0     | 18.6     |
| SeamlessM4T *medium*                    | 10.7    | 7.8     | 8.8     | 11.3    | 10.2    | 18.2     | 9.9     | 12.4     | 
| SeamlessM4T *v2-large*                  | **7.7** | **5.0** | 6.4     | **8.5** | **6.9** | 16.6     | **7.0** | **10.0** | 
| FAMA-ASR *small*                        | 13.8    | 8.9     | 5.8     | 12.6    | 7.2     | 15.7     | 8.9     | 12.4     |
| FAMA-ASR *medium*                       | 11.7    | 7.1     | 5.1     | 12.2    | 7.0     | 15.9     | 7.9     | 11.7     |
| FAMA *small*                            | 13.7    | 8.6     | 5.8     | 12.8    | 7.3     | **15.6** | 8.9     | 12.3     | 
| FAMA *medium*                           | 11.5    | 7.0     | 5.2     | 13.9    | 7.2     | 15.9     | 8.0     | 12.3     |

### Computational Time and Maximum Batch Size

| ***Model***            | ***Batch Size*** | ***xRTF en (↑)*** | ***xRTF it (↑)*** | ***xRTF AVG (↑)*** |
|------------------------|------------|-------------|-------------|--------------|
| Whisper *medium*       | 8          | 13.3        | 10.9        | 12.1         |
| Whisper *large-v3*     | 4          | 7.9         | 6.5         | 7.2          |
| SeamlessM4T *medium*   | 2          | 28.5        | 26.2        | 27.4         |
| SeamlessM4T *v2-large* | 2          | 13.7        | 13.3        | 13.5         |
| FAMA *small*           | 16         | **57.4**    | **56.0**    | **56.7**     |
| FAMA *medium*          | 8          | 39.5        | 41.2        | 40.4         |

## License

We release the FAMA model weights, and training data under the CC-BY 4.0 license. 
The training data can be found in [FAMA Training Data](https://huggingface.co/datasets/FBK-MT/fama-data).
The [original FBK-fairseq codebase](https://github.com/hlt-mt/FBK-fairseq) used to train the model is released under the Apache 2.0 license.

## Citation

If you use FAMA in your work, please cite:

```
@misc{papi2025fama,
      title={FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian}, 
      author={Sara Papi and Marco Gaido and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
      year={2025}
}
```