Model Details

Model Description

A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the SLAM-ASR speechLLM framework. Within this framework, only the linear projector was trained alongside a frozen speech encoder (Whisper-large-v3-turbo) and frozen LLM (EuroLLM-1.7B).

  • Developed by: SpeechTek Unit at Fondazione Bruno Kessler
  • Funded by: This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
  • Model type: Linear projector in a speechLLM framework
  • Supported Language(s): English, Italian, Spanish, German, French
  • License: CC-BY-4.0

Uses

This model is trained for Automatic Speech Recognition (ASR).

How to Get Started with the Model

This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the SLAM-ASR codebase. Kindly refer to the instructions there for further details.

Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.

Training Details

Training Data

The linear projector was trained with a total of 500 hours of data from Common Voice 20.0 and Fleurs, covering 5 languages (English, Italian, Spanish, German, and French). Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language.

Training Procedure

  • The model was trained using the code-based provided by the official SLAM-ASR Github repository with torchrun.
  • Only the linear projector was trained.
  • The whisper-large-v3-turbo speech encoder (Whisper-large-v3-turbo) and LLM (EuroLLM-1.7B) were kept frozen.
  • No prompt was used during training and inference.
  • Training was conducted with one NVIDIA Ada Lovelace L40S GPU.

Training Hyperparameters

llm_name eurollm-1.7b
llm_dim 2048
context_length 4096
encoder_name whisper
encoder_projector_ds_rate 5
encoder_dim 1280
encoder_projector linear
input_type mel
mel_size 128
epochs 6
freeze_encoder true
freeze_llm true
warmup_steps 1000
total_steps 100000
lr 1e-4
validation_interval 1000
batch_size_training 4
val_size_training 4
num_workers_dataloader 2
optimizer AdamW
enable_fdsp false
enable_ddp true
use_fp16 true

Evaluation

The model was evaluated using the Word Error Rate (WER) metric from the evaluate library. Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the Whisper EnglishTextNormalizer for English and BasicTextNormalizer for all other languages. Beam search decoding is used with beam size = 4.

Results

Dataset Language WER (%) ↓
Common Voice 20.0 English 13.5
Fleurs English 5.5
Common Voice 20.0 Italian 6.4
Fleurs Italian 5.8
Common Voice 20.0 Spanish 6.0
Fleurs Spanish 4.3
Common Voice 20.0 German 8.8
Fleurs German 10.3
Common Voice 20.0 French 11.5
Fleurs French 8.1

Acknowledgements

This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).

Citation

BibTeX:

Please cite the associated Interspeech 2025 paper when using this model (finalized citation pending):

@inproceedings{fong2025speechllmlowres,
  title={Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the
Impact of Pretraining on High-Resource Languages},
  author={Fong, Seraphina and Matassoni, Marco and Brutti, Alessio},
  booktitle={Interspeech},
  pages={},
  year={2025},
  note={In press; accepted for Interspeech 2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including SpeechTek/mEUltilingual_speechllm_linear_projector_v1