---
license: apache-2.0
language:
- en
- zh
- th
- id
- vi
pipeline_tag: audio-text-to-text
tags:
- multimodal
- audio-language-model
- audio
base_model:
- mispeech/dasheng-0.6B
- Qwen/Qwen2.5-Omni-7B
base_model_relation: finetune
---
## π₯ Key Highlights
**State-of-the-Art Performance**
- Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on **multiple key audio understanding tasks**.
**High Efficiency**
- **3.2Γ** throughput speedup at comparable batch sizes compared to Qwen2.5-Omni-7B.
- **20x** throughput speedup by increasing furhter batchsizes. We tested up to a **batch size=512** for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
- Time-to-first-token (TTFT) speedup of up to **4x** compared to Qwen2.5-Omni-7B.
**Caption-based Alignment**
- Trained with **general audio captions** (instead of ASR transcripts) to achieve holistic audio understanding.
**Full Transparency**
- **Public-source** training data and reproducible pipeline.
- Apache License 2.0 for **both research and commercial use**.
## Acknowledgment and Model Foundation
Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models,
we acknowledge **Qwen2.5-Omni as a remarkable and respected foundational work** in the field.
Our model specifically uses [Qwen2.5-Omni-7B Thinker](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) as the initialization for decoder training, building upon its robust architecture and weight initialization.
The audio encoder is built upon [Dasheng](https://github.com/XiaoMi/dasheng), an open-source audio encoder for general audio understanding with state-of-the-art performance.
**Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance**.
## Framework
MiDashengLM integrates the powerful Dasheng audio encoder with
the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy.
Unlike conventional ASR-driven approaches,
our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements
in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.
### Why Captions Instead of ASR?
ASR Limitations:
- Discards huge amount of non-speech audio (music/environmental sounds).
- Misses paralinguistic info (speaker emotion, acoustic properties).
- Monotonic alignment provides trivial learning signal.
Caption Advantages:
- Utilizes all audio content.
- Captures global audio context.
- Non-monotonic alignment provides a hard learning signal.
### Novel Open Source Dataset for Training: ACAVCaps
ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source [ACAV100M audio repository](https://acav100m.github.io/).
While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding.
We devide the dataset into six categories:
| Category | Example Caption |
|----------|-----------------|
| Pure Speech | "A female voice narrates historical competition with synthetic modulation" |
| Pure Sound | "Outdoor scene with wind, birds, duck quacking and background noise" |
| Pure Music | "Crowd cheering with electronic synthesizer-driven soundscape" |
| Mixed Music | "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape." |
| Mixed Speech | "A Russian voice demonstrates a synthesizerβs capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone." |
| Mixed Sound | "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle." |
The figure below illustrates our data curation pipeline for ACAVCaps:
Each caption is generated through a three-step process:
1. **Multi-expert analysis** (speech, vocal, music, acoustics)
2. **LLM reasoning** synthesizing metadata with [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
3. **Filtering** for audio-text consistency with [Dasheng-GLAP](https://github.com/xiaomi-research/dasheng-glap)
We will **release the ACAVCaps dataset** after the ICASSP 2026 review process.
## Usage
### Load Model
```python
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_id = "mispeech/midashenglm-7b"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
```
### Construct Prompt
```python
user_prompt = "Caption the audio." # You may try any other prompt
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful language and speech assistant."}
],
},
{
"role": "user",
"content": [
{"type": "text", "text": user_prompt},
{
"type": "audio",
"path": "/path/to/example.wav",
# or "url": "https://example.com/example.wav"
# or "audio": np.random.randn(16000)
},
],
},
]
```
### Generate Output
```python
import torch
with torch.no_grad():
model_inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
add_special_tokens=True,
return_dict=True,
)
generation = model.generate(**model_inputs)
output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
```
## Results
MiDashengLM delivers solid performance across diverse audio understanding tasks.
### Audio Captioning Results
| Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
| Music | MusicCaps | **59.71** | 43.71 | 35.43 |
| Music | Songdescriber | **45.39** | 45.31 | 44.63 |
| Sound | AudioCaps | **62.18** | 60.79 | 49.00 |
| Sound | ClothoV2 | **49.20** | 47.55 | 48.01 |
| Sound | AutoACD | **66.52** | 55.93 | 44.76 |
*Metrics: FENSE (higher is better).*
### Audio and Paralinguistic Classification
| Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:----------------:|:------:|:--------------:|:----------------:|:------------------:|
| VoxCeleb1 | ACCβ | **92.36** | 59.71 | 82.72 |
| VoxLingua107 | ACCβ | **93.41** | 51.03 | 73.65 |
| VoxCeleb-Gender | ACCβ | 96.12 | **99.82** | 99.69 |
| VGGSound | ACCβ | **52.11** | 0.97 | 2.20 |
| Cochlscene | ACCβ | **74.06** | 23.88 | 18.34 |
| NSynth | ACCβ | **80.52** | 60.45 | 38.09 |
| FMA | ACCβ | 63.73 | **66.77** | 27.91 |
| FSDKaggle2018 | ACCβ | **75.25** | 31.38 | 24.75 |
| AudioSet | mAPβ | **8.86** | 6.48 | 3.47 |
| FSD50K | mAPβ | **37.58** | 23.87 | 27.23 |
### ASR Performance
| Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:------------------:|:-----------:|:--------------:|:------------:|:-------------------:|
| LibriSpeech test-clean | English | 3.7 | 1.7 | **1.3** |
| LibriSpeech test-other | English | 6.2 | 3.4 | **2.4** |
| People's Speech | English | 27.8 | 28.6 | **22.3** |
| AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 |
| AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** |
| AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** |
| GigaSpeech2 | Indonesian | **20.8** | 21.2 | >100 |
| GigaSpeech2 | Thai | **36.9** | 53.8 | >100 |
| GigaSpeech2 | Viet | **18.1** | 18.6 | >100 |
*Metrics: WER/CER (lower is better).*
### Question Answering Results
| Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:------------:|:-------:|:------:|:--------------:|:----------------:|:-------------------:|
| MuChoMusic | | ACCβ | **71.35** | 64.79 | 67.40 |
| MMAU | Sound | ACCβ | 68.47 | 67.87 | **74.17** |
| MMAU | Music | ACCβ | 66.77 | **69.16** | 61.08 |
| MMAU | Speech | ACCβ | **63.66** | 59.76 | 57.66 |
| MMAU | Average | ACCβ | **66.30** | 65.60 | 64.30 |
| MusicQA | | FENSEβ | **62.35** | 60.60 | 40.00 |
| AudioCaps-QA | | FENSEβ | **54.31** | 53.28 | 47.34 |
*Metrics: Higher is better.*
### Reproduction Instructions
To reproduce our results, we provide:
- Prompts ([prompt.csv](evaluate/prompt.csv))
- Evaluation scripts
- Example JSONL files
#### 1. Install Dependencies for Evaluation (No need this for inference)
```bash
pip install -r requirements.txt
```
#### 2. Generate Model Outputs
Generate responses using the model's official framework with prompts from [prompt.csv](evaluate/prompt.csv).
#### 3. Convert Outputs to JSONL Format
Format model outputs using the [example JSONL](evaluate/jsonl) files:
| Task | Example File |
|------|--------------|
| Automatic Speech Recognition | [MiDashengLM_LibriSpeech_test-clean.jsonl](evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl) |
| Single-target Audio Tagging | [MiDashengLM_NSynth.jsonl](evaluate/jsonl/MiDashengLM_NSynth.jsonl) |
| Gender Recognition | [MiDashengLM_VoxCeleb-Gender.jsonl](evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl) |
| Multi-target Audio Tagging | [MiDashengLM_FSD50K.jsonl](evaluate/jsonl/MiDashengLM_FSD50K.jsonl) |
| Audio Captioning | [MiDashengLM_AutoACD.jsonl](evaluate/jsonl/MiDashengLM_AutoACD.jsonl) |
| Open Audio Question Answering | [MiDashengLM_MusicQA.jsonl](evaluate/jsonl/MiDashengLM_MusicQA.jsonl) |
| Audio QA with Options | [MiDashengLM_MuChoMusic.jsonl](evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl) |
#### 4. Evaluate Results
Execute the corresponding evaluation scripts:
```bash
# Automatic Speech Recognition (WER)
# Uses: lang, text, model_output
python evaluate/wer/compute_wer.py -i evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl
# Single-target Audio Tagging (ACC)
# Uses: label, model_output
python evaluate/compute_at_acc.py -i evaluate/jsonl/MiDashengLM_NSynth.jsonl
# Gender Recognition (ACC)
# Uses: label, model_output
python evaluate/compute_gender_acc.py -i evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl
# Multi-target Audio Tagging (mAP)
# Uses: dataset_name, label, model_output, model_name
python evaluate/compute_map.py -i evaluate/jsonl/MiDashengLM_FSD50K.jsonl
# Audio Captioning (FENSE)
# Uses: audio, text, model_output
python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_AutoACD.jsonl
# Open Audio QA (FENSE)
# Uses: audio, answer, model_output
python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_MusicQA.jsonl
# Audio QA with Options (ACC)
# Uses: answer, model_output
python evaluate/compute_qa_acc.py -i evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl
```
#### 5. Evaluate on MECAT and MMAU benchmarks
Please refer to the official repositories for evaluation on the [MECAT](https://github.com/xiaomi-research/mecat)
and [MMAU](https://github.com/Sakshi113/mmau) benchmarks.
## Efficiency
MiDashengLM demonstrates superior inference efficiency compared to Qwen2.5-Omni-7B,
achieving 3.2Γ speedup at comparable batch sizes and an overall potential speedup of 20.2Γ with larger batches.
| Batch Size | MiDashengLM (samples/s) | Qwen2.5-Omni-7B (samples/s) | Speedup |
|:----------:|:-----------------------:|:----------------------------:|:-------:|
| 1 | 0.45 | 0.36 | 1.25x |
| 4 | 1.40 | 0.91 | 1.53x |
| 8 | 2.72 | 1.15 | 2.36x |
| 16 | 5.18 | OOM | - |
| 32 | 9.78 | OOM | - |
| 64 | 17.07 | OOM | - |
| 128 | 22.73 | OOM | - |
| 200 | 25.15 | OOM | - |
*Tested on 80GB GPU with 30s audio, 100-token output.*
## Training Data
MiDashengLM is trained exclusively on publicly available datasets across five categories: Speech, Sound and General Audio, Speech and Paralinguistic, Music, and Question Answering. All datasets are listed below with their respective tasks, lengths, and supervised fine-tuning (SFT) usage.
### Speech Training Data
This table lists speech-related datasets used for tasks like Automatic Speech Recognition (ASR), keyword spotting (KWS), and speech-to-text translation (S2TT).
The column βSFT?β indicates whether the dataset is used for supervised fine-tuning.
| Data | Task | Length(h) | SFT? |
|:----------------------:|:---------:|:---------:|:----:|
| LibriSpeech | ASR | 960 | β |
| LibriHeavy | ASR | 50,000 | X |
| GigaSpeech | ASR | 10,000 | β |
| GigaSpeech2 | ASR | 30,000 | β |
| WeNetSpeech | ASR | 10,000 | β |
| Yodas | ASR | 320,000 | X |
| CommonVoice-17.0 | ASR | 5,000 | β |
| AISHELL-1 | ASR | 100 | β |
| AISHELL-2 | ASR | 1,000 | β |
| AISHELL-3 | ASR | 70 | β |
| LJSpeech-1.1 | ASR | 37 | X |
| LibriTTS | ASR | 585 | X |
| MultiLingualSpokenWords| KWS | 5,000 | X |
| Emilia | ASR | 101,000 | β |
| CovoST-v2 | S2TT | 2,880 | β |
| Fleurs | S2TT | 1,224 | X |
| MSR-86K | ASR, LangID| 86,000 | β |
| ACAV100M-Speech | ASR | 55,754 | X |
| Must-C | ASR,S2TT | 1,000 | β |
| MLS | ASR | 50,000 | X |
| SpgiSpeech | ASR | 5,000 | X |
| PeoplesSpeech | ASR | 30,000 | X |
| KeSpeech | ASR | 1,400 | β |
| LAION-300M | Caption | 230,000 | X |
| **Total** | | **997,010**| **258.410** |
### Sound and General Audio Datasets
| Dataset | Task | Length(h) | SFT? |
|:--------------:|:------------------------:|:---------:|:----:|
| FSD50k | Sound Event | 77 | β |
| AudioSet | Sound Event | 5,200 | |
| AudioSet-strong| Sound Event | 220 | X |
| VGGSound | Sound Event | 540 | β |
| FSDKaggle2018 | Sound Event | 20 | β |
| FSDKaggle2019 | Sound Event | 100 | |
| ARCA23k | Sound Event | 120 | X |
| AutoACD | Audio(Sound) Caption | 5,200 | β |
| AudioSetCaps | Audio(Sound) Caption | 6,000 | β |
| SoundVECaps | Audio(Sound) Caption | 5,000 | β |
| WavCaps | Audio(Sound) Caption | 7,567 | β |
| Audiocaps | Audio(Sound) Caption | 100 | β |
| Clothov2 | Audio(Sound) Caption | 17 | β |
| TACOS | Audio(Sound) Caption | 98 | β |
| CochlScene | SoundScape | 500 | β |
| BirdSet | SoundScape | 7,000 | X |
| ACAVCaps | General Caption | 38,662 | β |
| **Total** | | **76.421**| **69.081** |
### Speech and Paralinguistic Datasets
| Dataset | Task | Length(hours) | SFT? |
|:------------------:|:-----------------------------:|:-------------:|:----:|
| IEMOCAP | Emotion | 8 | β |
| Meld | Emotion | 12 | β |
| SUBESCO | Emotion | 9 | X |
| RAVDESS-Speech | Emotion | 2 | X |
| RAVDESS-Song | Emotion | 1 | X |
| CREMA-D | Emotion | 4 | X |
| ESD | Emotion | 29 | X |
| VocalSound | Vocal sound classification | 20 | β |
| NonSpeech7k | Vocal sound classification | 3 | β |
| VoxLingua107 | Language identification | 7,200 | β |
| CommonLanguage | Language identification | 45 | β |
| YLACombe | Language identification | 5 | X |
| VoxCeleb1 | Speaker verification | 76 | β |
| CNCeleb | Speaker verification & age | 2,100 | β |
| VoxCeleb2 | Speaker verification | 1,000 | β |
| VoxBlink1 | Speaker verification | 1,300 | |
| VoxBlink2 | Speaker verification | 2,600 | β |
| VoxTube | Language identification | 5,200 | β |
| LibriCount | Speaker counting | 8 | β |
| FluentSpeechCommands | Intent classification & gender | 17 | X |
| SpeechOcean762 | Speaker age | 5 | X |
| ASVSpoof5 | Spoof detection | 603 | X |
| **Total** | | **20,247** | **19,572** |
### Music-Related Datasets
Covers music captioning, genre recognition, instrument classification, and singing style identification.
| Dataset | Task | Length(h) | SFT? |
|:---------------:|:---------------------------------:|:---------:|:----:|
| MusicCaps | Music Caption | 15 | β |
| Songdescriber | Music Caption | 23 | β |
| LPMusicCaps-MTT | Music Caption | 18 | β |
| LPMusicCaps-MSD | Music Caption | 1,000 | β |
| VocalSet | Singing style identification | 10 | X |
| FreeMusicArchive| Genre recognition | 610 | β |
| MTG-Jamendo | Instrument classification Genre recognition | 3,768 | β |
| NSynth | Instrument classification | 360 | β |
| GoodSounds | Instrument classification | 28 | β |
| chMusic | Instrument classification | 1 | β |
| CTIS | Instrument classification | 1 | β |
| **Total** | | **5,824** | **5,814** |
### Question Answering Datasets
Used for training on audio-visual QA, environment QA, and music QA tasks. Most support SFT.
| Dataset | Task | # QA | SFT? |
|:---------:|:---------------:|:--------:|:----:|
| AVQA | Environment QA | 36,114 | β |
| ClothoAQA | Environment QA | 6,175 | β |
| TACOS+ | Environment QA | 40,019 | β |
| MusicQA | Music QA | 112,878 | β |
| SIFT-50M | Speech QA | 21,430,000 | β |
| ACAV-QA | General QA | 24,371 | β |
## Citation
MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
If you find MiDashengLM useful in your research, please consider citing our work:
```bibtex
@techreport{midashenglm7b,
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
author = {{Horizon Team, MiLM Plus}},
institution= {Xiaomi Inc.},
year = {2025},
note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
url = {https://arxiv.org/abs/2508.03983},
eprint = {2508.03983},
}
```