Audio-Text-to-Text
Transformers
Safetensors
qwen2_audio
text2text-generation
Inference Endpoints
frankenliu GrantL10 commited on
Commit
de2b10f
·
verified ·
1 Parent(s): df93550

Update README.md (#7)

Browse files

- Update README.md (2e5d1a49e0b26482a6313139524b43b3c3ffcf6a)


Co-authored-by: Gang Li <[email protected]>

Files changed (1) hide show
  1. README.md +22 -8
README.md CHANGED
@@ -11,7 +11,7 @@ pipeline_tag: audio-text-to-text
11
 
12
  ## Introduction
13
 
14
- R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
15
  This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
16
  For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
17
 
@@ -23,10 +23,12 @@ Our main findings are as follows:
23
  - Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.
24
 
25
  Additional Notes:
26
- The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
27
 
 
 
28
 
29
  ### Table: Accuracies (%) on MMAU Test-mini benchmark
 
30
  | Model | Method | Sound | Music | Speech | Average |
31
  |--------------------------------------------|-------------------------|--------|--------|--------|---------|
32
  | \ | Human\* | 86.31 | 78.22 | 82.17 | 82.23 |
@@ -39,18 +41,18 @@ The AVQA training set originally consists of approximately 40k samples. However,
39
  | GPT4o + Weak Cap. | Direct Inference\* | 39.33 | 41.90 | 58.25 | 45.70 |
40
  | Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 38.02 | 54.05 | 42.10 |
41
  | SALMONN | Direct Inference\* | 41.00 | 34.80 | 25.50 | 33.70 |
42
- | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | 64.30 | 60.70 | 61.71 |
43
- | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | 56.29 | 55.26 | 57.80 |
44
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours)** | **69.37** | 66.77 | 57.36 | **64.50** |
45
 
46
- #### Notes:
 
47
  \* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
48
  \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
49
  \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
50
 
51
-
52
-
53
  ## Inference
 
54
  ```python
55
  import torch
56
  import torchaudio
@@ -87,4 +89,16 @@ generated_ids = generated_ids[:, inputs.input_ids.size(1):]
87
  response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
88
 
89
  print(response)
90
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## Introduction
13
 
14
+ R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
15
  This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
16
  For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
17
 
 
23
  - Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.
24
 
25
  Additional Notes:
 
26
 
27
+ - The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
28
+ - The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
29
 
30
  ### Table: Accuracies (%) on MMAU Test-mini benchmark
31
+
32
  | Model | Method | Sound | Music | Speech | Average |
33
  |--------------------------------------------|-------------------------|--------|--------|--------|---------|
34
  | \ | Human\* | 86.31 | 78.22 | 82.17 | 82.23 |
 
41
  | GPT4o + Weak Cap. | Direct Inference\* | 39.33 | 41.90 | 58.25 | 45.70 |
42
  | Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 38.02 | 54.05 | 42.10 |
43
  | SALMONN | Direct Inference\* | 41.00 | 34.80 | 25.50 | 33.70 |
44
+ | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | 64.30 | 60.70 | 61.71 |
45
+ | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | 56.29 | 55.26 | 57.80 |
46
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours)** | **69.37** | 66.77 | 57.36 | **64.50** |
47
 
48
+ #### Notes
49
+
50
  \* The data are sourced from the MMAU official website: [https://sakshi113.github.io/mmau_homepage/](https://sakshi113.github.io/mmau_homepage/)
51
  \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
52
  \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
53
 
 
 
54
  ## Inference
55
+
56
  ```python
57
  import torch
58
  import torchaudio
 
89
  response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
90
 
91
  print(response)
92
+ ```
93
+
94
+ ## Citation
95
+
96
+ ```bib
97
+ @article{li2025reinforcement,
98
+ title={Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering},
99
+ author={Li, Gang and Liu, Jizhong and Dinkel, Heinrich and Niu, Yadong and Zhang, Junbo and Luan, Jian},
100
+ journal={arXiv preprint arXiv:2503.11197},
101
+ year={2025},
102
+ url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
103
+ }
104
+ ```