Audio-Reasoner
We implemented inference scaling on Audio-Reasoner, a large audio language model, enabling deepthink and structured chain-of-thought (COT) reasoning for multimodal understanding and reasoning. To achieve this, we constructed CoTA, a high-quality dataset with 1.2M reasoning-rich samples using structured COT techniques. Audio-Reasoner achieves state-of-the-art results on MMAU-mini(+25.42%) and AIR-Bench-Chat(+14.57%) benchmarks.
Audio-Reasoner-7B 🤗 | CoTA Dataset 🤗 (coming soon)
Paper 📑 | Wechat 💭 | Code ⚙️
Demo • Install • Quick Start • FAQ • Contact us
If you like us, pls give us a star⭐ !
Main Results
News and Updates
- 2025.03.05: ✅Audio-Reasoner-7B checkpoint is released on HuggingFace🤗 !
- 2025.03.05: ✅Audio-Reasoner Paper is uploaded to arXiv 📑.
- 2025.03.04: ✅Demos, inference code and evaluation results have been released.
- 2025.03.04: ✅Create this repo.
Roadmap
2025.03: 🔜Upload CoTA dataset to HuggingFace🤗.
2025.04: 🔜Open-source data systhesis pipeline and training code.
Demo
Features
✅ Audio-Reasoner enables deep reasoning and inference scaling in audio-based tasks, built on Qwen2-Audio-Instruct with structured CoT training.
✅ CoTA offers 1.2M high-quality captions and QA pairs across domains for structured reasoning and enhanced pretraining.
✅ Pretrained model and dataset encompassing various types of audio including sound, music, and speech, has achieved state-of-the-art results across multiple benchmarks. Refer to our paper for details.
Install
Clone and install
- Clone the repo
git clone https://github.com/xzf-thu/Audio-Reasoner.git
cd Audio-Reasoner
- Install the required packages
conda create -n Audio-Reasoner python=3.10
conda activate Audio-Reasoner
pip install -r requirements.txt
pip install transformers==4.49.1
Quick Start
Chat using ms-swift
import os
import re
from typing import List, Literal
from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig, load_dataset, get_template
from swift.plugin import InferStats
def infer_stream(engine: 'InferEngine', infer_request: 'InferRequest'):
request_config = RequestConfig(max_tokens=2048, temperature=0, stream=True)
metric = InferStats()
gen = engine.infer([infer_request], request_config, metrics=[metric])
query = infer_request.messages[0]['content']
output = ""
print(f'query: {query}\nresponse: ', end='')
for resp_list in gen:
if resp_list[0] is None:
continue
print(resp_list[0].choices[0].delta.content, end='', flush=True)
output += resp_list[0].choices[0].delta.content
print()
print(f'metric: {metric.compute()}')
return output
def get_message(audiopath, prompt):
messages = [
{"role": "system", "content": system},
{
'role':
'user',
'content': [{
'type': 'audio',
'audio': audiopath
}, {
'type': 'text',
'text': prompt
}]
}]
return messages
system = 'You are an audio deep-thinking model. Upon receiving a question, please respond in two parts: <THINK> and <RESPONSE>. The <THINK> section should be further divided into four parts: <PLANNING>, <CAPTION>, <REASONING>, and <SUMMARY>.'
infer_backend = 'pt'
model = 'qwen2_audio'
last_model_checkpoint = "" #Please replace it with the path to checkpoint
engine = PtEngine(last_model_checkpoint, max_batch_size=64, model_type = model)
def audioreasoner_gen(audiopath, prompt):
return infer_stream(engine, InferRequest(messages=get_message(audiopath, prompt)))
def main():
#Please replace it with your test aduio
audiopath = "assets/test.wav"
#Please replace it with your questions about the test aduio
prompt = "Which of the following best describes the rhythmic feel and time signature of the song?"
audioreasoner_gen(audiopath, prompt)
if __name__ == '__main__':
main()
Local test
conda activate Audio-Reasoner
cd Audio-Reasoner
# test run the preset audio samples and questions
python inference.py
FAQ
1. What kind of audio can Audio - Reasoner understand and what kind of thinking does it perform? Audio - Reasoner can understand various types of audio, including sound, music, and speech. It conducts in - depth thinking in four parts: planning, caption, reasoning, and summary.
2. Why is transformers installed after 'ms-swift' in the environment configuration?
The version of transformers has a significant impact on the performance of the model. We have tested that version transformers==4.49.1
is one of the suitable versions. Installing ms-swift first may ensure a more stable environment for the subsequent installation of transformers to avoid potential version conflicts that could affect the model's performance.
Contact
If you have any questions, please feel free to contact us via [email protected]
.
Citation
Please cite our paper if you find our model and detaset useful. Thanks!
@misc{xie2025audioreasonerimprovingreasoningcapability,
title={Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models},
author={Zhifei Xie and Mingbao Lin and Zihang Liu and Pengcheng Wu and Shuicheng Yan and Chunyan Miao},
year={2025},
eprint={2503.02318},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2503.02318},
}
- Downloads last month
- 58