Add model card
Browse files
README.md
CHANGED
@@ -1,3 +1,170 @@
|
|
1 |
-
---
|
2 |
-
license:
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- it
|
6 |
+
datasets:
|
7 |
+
- FBK-MT/mosel
|
8 |
+
- facebook/covost2
|
9 |
+
- openslr/librispeech_asr
|
10 |
+
- facebook/voxpopuli
|
11 |
+
metrics:
|
12 |
+
- wer
|
13 |
+
tags:
|
14 |
+
- speech
|
15 |
+
- speech recognition
|
16 |
+
- ASR
|
17 |
+
---
|
18 |
+
|
19 |
+
# FAMA-small-asr
|
20 |
+
<div>
|
21 |
+
<img src="FAMA.png" width="100%" alt="FAMA" />
|
22 |
+
</div>
|
23 |
+
|
24 |
+
## Table of Contents
|
25 |
+
1. [Overview](#overview)
|
26 |
+
2. [Usage](#Usage)
|
27 |
+
3. [Results](#Results)
|
28 |
+
4. [License](#license)
|
29 |
+
5. [Citation](#citation)
|
30 |
+
|
31 |
+
## Overview
|
32 |
+
|
33 |
+
FAMA is the first family of large-scale open-science SFMs for English and
|
34 |
+
Italian trained on [over 150k hours of exclusively open-source(OS)-compliant speech data](https://huggingface.co/datasets/FBK-MT/fama-data).
|
35 |
+
|
36 |
+
FAMA models achieve [remarkable results](#results), with ASR and ST improvements on average across languages compared to OWSM,
|
37 |
+
and is competitive in terms of ASR performance with the Whisper model family while being up to 8 times faster.
|
38 |
+
|
39 |
+
All the artifacts used for realizing FAMA models, including codebase, datasets, and models
|
40 |
+
themself are [released under OS-compliant licenses](#license), promoting a more
|
41 |
+
responsible creation of models in our community.
|
42 |
+
|
43 |
+
|
44 |
+
It is available in 2 sizes, with 2 variants for ASR only:
|
45 |
+
|
46 |
+
- [FAMA-small](https://huggingface.co/FBK-MT/fama-small) - 475 million parameters
|
47 |
+
- [FAMA-medium](https://huggingface.co/FBK-MT/fama-medium) - 878 million parameters
|
48 |
+
- [FAMA-small-asr](https://huggingface.co/FBK-MT/fama-small-asr) - 475 million parameters
|
49 |
+
- [FAMA-medium-asr](https://huggingface.co/FBK-MT/fama-medium-asr) - 878 million parameters
|
50 |
+
|
51 |
+
For more information about FAMA, please check our [blog post](https://huggingface.co/blog/FAMA/release) and the [arXiv](https://arxiv.org/) preprint.
|
52 |
+
|
53 |
+
|
54 |
+
## Usage
|
55 |
+
|
56 |
+
FAMA models are supported in Hugging Face 🤗 Transformers.
|
57 |
+
To run the model, first install the Transformers and Datasets libraries.
|
58 |
+
|
59 |
+
```sh
|
60 |
+
pip install transformers==4.48.1 datasets
|
61 |
+
```
|
62 |
+
|
63 |
+
To perform a single inference on a sample audio file using the
|
64 |
+
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
65 |
+
class, run:
|
66 |
+
|
67 |
+
```python
|
68 |
+
import torch
|
69 |
+
from transformers import AutoProcessor, pipeline
|
70 |
+
from datasets import load_dataset
|
71 |
+
|
72 |
+
model_id = "FBK-MT/fama-small-asr"
|
73 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
74 |
+
|
75 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
76 |
+
tgt_lang = "en"
|
77 |
+
|
78 |
+
# Force the model to start with the language tag
|
79 |
+
lang_tag = "<lang:{}>".format(tgt_lang)
|
80 |
+
lang_tag_id = processor.tokenizer.convert_tokens_to_ids(lang_tag)
|
81 |
+
|
82 |
+
generate_kwargs = {"num_beams": 5, "no_repeat_ngram_size": 5, "forced_bos_token_id": lang_tag_id}
|
83 |
+
|
84 |
+
pipe = pipeline(
|
85 |
+
"automatic-speech-recognition",
|
86 |
+
model=model_id,
|
87 |
+
trust_remote_code=True,
|
88 |
+
torch_dtype=torch.float32,
|
89 |
+
device=device,
|
90 |
+
return_timestamps=False,
|
91 |
+
generate_kwargs=generate_kwargs
|
92 |
+
)
|
93 |
+
|
94 |
+
dataset = load_dataset("distil-whisper/librispeech_asr_dummy", "clean", split="validation")
|
95 |
+
sample = dataset[0]["audio"]
|
96 |
+
|
97 |
+
result = pipe(sample)
|
98 |
+
print(result["text"])
|
99 |
+
```
|
100 |
+
|
101 |
+
Where `tgt_lang` is the target language (either `en` or `it`). The source languages has not to be specified.
|
102 |
+
To run the inference on a local audio file `audio.wav`, call the pipeline with:
|
103 |
+
|
104 |
+
```python
|
105 |
+
result = pipe("audio.wav")
|
106 |
+
```
|
107 |
+
|
108 |
+
To perform a batch inference with size `batch_size`, run:
|
109 |
+
|
110 |
+
```python
|
111 |
+
result = pipe(["audio_1.wav", "audio_2.wav"], batch_size=2)
|
112 |
+
```
|
113 |
+
|
114 |
+
For the inference, we suggest converting the audio files in wav format with 16kHz sampling rate and 1 channel.
|
115 |
+
|
116 |
+
## Results
|
117 |
+
|
118 |
+
We evaluate FAMA-ASR on ASR using popular open-source datasets such as CommonVoice, Multilingual LibriSpeech (MLS), and VoxPopuli.
|
119 |
+
The metric used is WER (↓).
|
120 |
+
|
121 |
+
We also benchmark FAMA in terms of computational time and maximum batch size supported on HuggingFace against Whisper and SeamlessM4T models. The metric used is the inverse real time factor (xRTF).
|
122 |
+
|
123 |
+
**Key highlights:**
|
124 |
+
- FAMA achieves up to 4.2 WER improvement on average across languages compared to OWSM v3.1
|
125 |
+
- FAMA is up to 8 times faster than Whisper large-v3 while achieving comparable performance
|
126 |
+
|
127 |
+
|
128 |
+
### Automatic Speech Recogniton (ASR)
|
129 |
+
| ***Model/Dataset WER (↓)*** | **CommonVoice**-*en* | **CommonVoice**-*it* | **MLS**-*en* | **MLS**-*it* | **VoxPopuli**-*en* | **VoxPopuli**-*it* | **AVG**-*en* | **AVG**-*it* |
|
130 |
+
|-----------------------------------------|---------|---------|---------|---------|---------|----------|---------|----------|
|
131 |
+
| Whisper *medium* | 14.5 | 10.4 | 14.2 | 15.9 | 8.1 | 26.8 | 12.3 | 17.7 |
|
132 |
+
| Whisper *large-v3* | 11.2 | 6.5 | **5.0** | 8.8 | 7.1 | 18.8 | 7.8 | 11.4 |
|
133 |
+
| OWSM v3.1 *medium* | 11.9 | 12.5 | 6.6 | 19.3 | 8.4 | 24.0 | 9.0 | 18.6 |
|
134 |
+
| SeamlessM4T *medium* | 10.7 | 7.8 | 8.8 | 11.3 | 10.2 | 18.2 | 9.9 | 12.4 |
|
135 |
+
| SeamlessM4T *v2-large* | **7.7** | **5.0** | 6.4 | **8.5** | **6.9** | 16.6 | **7.0** | **10.0** |
|
136 |
+
| FAMA-ASR *small* | 13.8 | 8.9 | 5.8 | 12.6 | 7.2 | 15.7 | 8.9 | 12.4 |
|
137 |
+
| FAMA-ASR *medium* | 11.7 | 7.1 | 5.1 | 12.2 | 7.0 | 15.9 | 7.9 | 11.7 |
|
138 |
+
| FAMA *small* | 13.7 | 8.6 | 5.8 | 12.8 | 7.3 | **15.6** | 8.9 | 12.3 |
|
139 |
+
| FAMA *medium* | 11.5 | 7.0 | 5.2 | 13.9 | 7.2 | 15.9 | 8.0 | 12.3 |
|
140 |
+
|
141 |
+
|
142 |
+
### Computational Time and Maximum Batch Size
|
143 |
+
|
144 |
+
| ***Model*** | ***Batch Size*** | ***xRTF en (↑)*** | ***xRTF it (↑)*** | ***xRTF AVG (↑)*** |
|
145 |
+
|------------------------|------------|-------------|-------------|--------------|
|
146 |
+
| Whisper *medium* | 8 | 13.3 | 10.9 | 12.1 |
|
147 |
+
| Whisper *large-v3* | 4 | 7.9 | 6.5 | 7.2 |
|
148 |
+
| SeamlessM4T *medium* | 2 | 28.5 | 26.2 | 27.4 |
|
149 |
+
| SeamlessM4T *v2-large* | 2 | 13.7 | 13.3 | 13.5 |
|
150 |
+
| FAMA *small* | 16 | **57.4** | **56.0** | **56.7** |
|
151 |
+
| FAMA *medium* | 8 | 39.5 | 41.2 | 40.4 |
|
152 |
+
|
153 |
+
|
154 |
+
## License
|
155 |
+
|
156 |
+
We release the FAMA model weights, and training data under the CC-BY 4.0 license.
|
157 |
+
The training data can be found in [FAMA Training Data](https://huggingface.co/datasets/FBK-MT/fama-data).
|
158 |
+
The [original FBK-fairseq codebase](https://github.com/hlt-mt/FBK-fairseq) used to train the model is released under the Apache 2.0 license.
|
159 |
+
|
160 |
+
## Citation
|
161 |
+
|
162 |
+
If you use FAMA in your work, please cite:
|
163 |
+
|
164 |
+
```
|
165 |
+
@misc{papi2025fama,
|
166 |
+
title={FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian},
|
167 |
+
author={Sara Papi and Marco Gaido and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
|
168 |
+
year={2025}
|
169 |
+
}
|
170 |
+
```
|