LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
Abstract
LLaSO is an open framework for large-scale speech-language modeling that provides alignment data, instruction-tuning datasets, and evaluation benchmarks to enhance reproducibility and performance.
The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.
Community
We introduce LLaSO, the first fully open, end-to-end stack for large-scale speech–language modeling.
It unifies corpus, benchmark, and reference models in one framework:
- LLaSO-Instruct (13.5M) multi-task instruction tuning dataset
- LLaSO-Align (12M) speech–text alignment dataset
- LLaSO-Eval (15K) stratified benchmark
- LLaSO-Base (3.8B) two-stage trained reference model
👉 Code: https://github.com/EIT-NLP/LLaSO
👉 Datasets: https://huggingface.co/datasets?search=LLaSO
👉 Model: https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct
We are currently uploading LLaSO-Instruct and will soon release LLaSO-Align.
Feedback and contributions are very welcome!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment (2025)
- DIFFA: Large Language Diffusion Models Can Listen and Understand (2025)
- Efficient Interleaved Speech Modeling through Knowledge Distillation (2025)
- GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness (2025)
- SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models (2025)
- DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models (2025)
- WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper