SpeechTek/mEUltilingual_speechllm_linear_projector_v1

Model Details

Model Description

A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the SLAM-ASR speechLLM framework. Within this framework, only the linear projector was trained alongside a frozen speech encoder (Whisper-large-v3-turbo) and frozen LLM (EuroLLM-1.7B).

Developed by: SpeechTek Unit at Fondazione Bruno Kessler
Funded by: This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
Model type: Linear projector in a speechLLM framework
Supported Language(s): English, Italian, Spanish, German, French
License: CC-BY-4.0

Uses

This model is trained for Automatic Speech Recognition (ASR).

How to Get Started with the Model

This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the SLAM-ASR codebase. Kindly refer to the instructions there for further details.

Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.

Training Details

Training Data

The linear projector was trained with a total of 500 hours of data from Common Voice 20.0 and Fleurs, covering 5 languages (English, Italian, Spanish, German, and French). Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language.

Training Procedure

The model was trained using the code-based provided by the official SLAM-ASR Github repository with torchrun.
Only the linear projector was trained.
The whisper-large-v3-turbo speech encoder (Whisper-large-v3-turbo) and LLM (EuroLLM-1.7B) were kept frozen.
No prompt was used during training and inference.
Training was conducted with one NVIDIA Ada Lovelace L40S GPU.

Training Hyperparameters


llm_name	eurollm-1.7b
llm_dim	2048
context_length	4096
encoder_name	whisper
encoder_projector_ds_rate	5
encoder_dim	1280
encoder_projector	linear
input_type	mel
mel_size	128
epochs	6
freeze_encoder	true
freeze_llm	true
warmup_steps	1000
total_steps	100000
lr	1e-4
validation_interval	1000
batch_size_training	4
val_size_training	4
num_workers_dataloader	2
optimizer	AdamW
enable_fdsp	false
enable_ddp	true
use_fp16	true

Evaluation

The model was evaluated using the Word Error Rate (WER) metric from the evaluate library. Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the Whisper EnglishTextNormalizer for English and BasicTextNormalizer for all other languages. Beam search decoding is used with beam size = 4.

Results

Dataset	Language	WER (%) ↓
Common Voice 20.0	English	13.5
Fleurs	English	5.5
Common Voice 20.0	Italian	6.4
Fleurs	Italian	5.8
Common Voice 20.0	Spanish	6.0
Fleurs	Spanish	4.3
Common Voice 20.0	German	8.8
Fleurs	German	10.3
Common Voice 20.0	French	11.5
Fleurs	French	8.1

Acknowledgements

This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).

Citation

BibTeX:

Please cite the associated Interspeech 2025 paper when using this model (finalized citation pending):

@inproceedings{fong2025speechllmlowres,
  title={Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the
Impact of Pretraining on High-Resource Languages},
  author={Fong, Seraphina and Matassoni, Marco and Brutti, Alessio},
  booktitle={Interspeech},
  pages={},
  year={2025},
  note={In press; accepted for Interspeech 2025}
}

SpeechTek
/

mEUltilingual_speechllm_linear_projector_v1