arxiv:2508.15418

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Published on Aug 21

· Submitted by

YirongSun on Aug 22

Upvote

Authors:

Yirong Sun ,

Abstract

LLaSO is an open framework for large-scale speech-language modeling that provides alignment data, instruction-tuning datasets, and evaluation benchmarks to enhance reproducibility and performance.

AI-generated summary

The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.

View arXiv page View PDF GitHub 14 Add to collection

Community

YirongSun

Paper author Paper submitter about 24 hours ago

We introduce LLaSO, the first fully open, end-to-end stack for large-scale speech–language modeling.
It unifies corpus, benchmark, and reference models in one framework:

LLaSO-Instruct (13.5M) multi-task instruction tuning dataset
LLaSO-Align (12M) speech–text alignment dataset
LLaSO-Eval (15K) stratified benchmark
LLaSO-Base (3.8B) two-stage trained reference model

👉 Code: https://github.com/EIT-NLP/LLaSO
👉 Datasets: https://huggingface.co/datasets?search=LLaSO
👉 Model: https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct

We are currently uploading LLaSO-Instruct and will soon release LLaSO-Align.
Feedback and contributions are very welcome!

librarian-bot

about 11 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.15418 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.15418 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.