Audio-Text-to-Text
Transformers
Safetensors
qwen2_audio
text2text-generation
Inference Endpoints
frankenliu GrantL10 commited on
Commit
df93550
·
verified ·
1 Parent(s): d0ac662

Update README.md (#6)

Browse files

- Update README.md (8c1459b2c0c2384cf288f48ddbec31641a578178)


Co-authored-by: Gang Li <[email protected]>

Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -15,6 +15,17 @@ R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instru
15
  This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
16
  For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
17
 
 
 
 
 
 
 
 
 
 
 
 
18
  ### Table: Accuracies (%) on MMAU Test-mini benchmark
19
  | Model | Method | Sound | Music | Speech | Average |
20
  |--------------------------------------------|-------------------------|--------|--------|--------|---------|
 
15
  This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
16
  For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
17
 
18
+ Our main findings are as follows:
19
+
20
+ - The GRPO algorithm can be directly and effectively applied to the audio modality, even to `Qwen2-Audio-7B-Instruct` with only 8.2B parameters.
21
+ - With only 38k post-training samples, reinforcement learning outperforms supervised fine-tuning, indicating that RL-based approaches can be effective without large datasets.
22
+ - The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently leverage *deep thinking* or step-by-step reasoning remains an open question for further research.
23
+ - Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.
24
+
25
+ Additional Notes:
26
+ The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
27
+
28
+
29
  ### Table: Accuracies (%) on MMAU Test-mini benchmark
30
  | Model | Method | Sound | Music | Speech | Average |
31
  |--------------------------------------------|-------------------------|--------|--------|--------|---------|