Model Details
Model Description
A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the SLAM-ASR speechLLM framework. Within this framework, only the linear projector was trained alongside a frozen speech encoder (Whisper-large-v3-turbo) and frozen LLM (EuroLLM-1.7B).
- Developed by: SpeechTek Unit at Fondazione Bruno Kessler
- Funded by: This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
- Model type: Linear projector in a speechLLM framework
- Supported Language(s): English, Italian, Spanish, German, French
- License: CC-BY-4.0
Uses
This model is trained for Automatic Speech Recognition (ASR).
How to Get Started with the Model
This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the SLAM-ASR codebase. Kindly refer to the instructions there for further details.
Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.
Training Details
Training Data
The linear projector was trained with a total of 500 hours of data from Common Voice 20.0 and Fleurs, covering 5 languages (English, Italian, Spanish, German, and French). Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language.
Training Procedure
- The model was trained using the code-based provided by the official SLAM-ASR Github repository with
torchrun
. - Only the linear projector was trained.
- The whisper-large-v3-turbo speech encoder (Whisper-large-v3-turbo) and LLM (EuroLLM-1.7B) were kept frozen.
- No prompt was used during training and inference.
- Training was conducted with one NVIDIA Ada Lovelace L40S GPU.
Training Hyperparameters
llm_name | eurollm-1.7b |
llm_dim | 2048 |
context_length | 4096 |
encoder_name | whisper |
encoder_projector_ds_rate | 5 |
encoder_dim | 1280 |
encoder_projector | linear |
input_type | mel |
mel_size | 128 |
epochs | 6 |
freeze_encoder | true |
freeze_llm | true |
warmup_steps | 1000 |
total_steps | 100000 |
lr | 1e-4 |
validation_interval | 1000 |
batch_size_training | 4 |
val_size_training | 4 |
num_workers_dataloader | 2 |
optimizer | AdamW |
enable_fdsp | false |
enable_ddp | true |
use_fp16 | true |
Evaluation
The model was evaluated using the Word Error Rate (WER) metric from the evaluate
library.
Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the Whisper EnglishTextNormalizer
for English and BasicTextNormalizer
for all other languages.
Beam search decoding is used with beam size = 4
.
Results
Dataset | Language | WER (%) ↓ |
---|---|---|
Common Voice 20.0 | English | 13.5 |
Fleurs | English | 5.5 |
Common Voice 20.0 | Italian | 6.4 |
Fleurs | Italian | 5.8 |
Common Voice 20.0 | Spanish | 6.0 |
Fleurs | Spanish | 4.3 |
Common Voice 20.0 | German | 8.8 |
Fleurs | German | 10.3 |
Common Voice 20.0 | French | 11.5 |
Fleurs | French | 8.1 |
Acknowledgements

Citation
BibTeX:
Please cite the associated Interspeech 2025 paper when using this model (finalized citation pending):
@inproceedings{fong2025speechllmlowres,
title={Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the
Impact of Pretraining on High-Resource Languages},
author={Fong, Seraphina and Matassoni, Marco and Brutti, Alessio},
booktitle={Interspeech},
pages={},
year={2025},
note={In press; accepted for Interspeech 2025}
}