--- language: - en tags: - audio-text-to-audio-text - speech-understanding - audio - chat license: apache-2.0 datasets: - custom metrics: - wer - bleu - AIR-Bench ---

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

🐈‍⬛ Github |  📃 Paper |  🚀 Space |  📊 EchoX-Dialougues |  📊 EchoX-Dialogues-Plus

## Model Description EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks. ### Key Features
## Usage Load the EchoX model and run inference with your audio files as shown in the GitHub repository. # 📖 Citation ``` @misc{zhang2025echoxmitigatingacousticsemanticgap, title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li}, year={2025}, eprint={2509.09174}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.09174}, } ```