Update README.md (#6)
Browse files- Update README.md (8c1459b2c0c2384cf288f48ddbec31641a578178)
Co-authored-by: Gang Li <[email protected]>
README.md
CHANGED
@@ -15,6 +15,17 @@ R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instru
|
|
15 |
This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
|
16 |
For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
|
17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
### Table: Accuracies (%) on MMAU Test-mini benchmark
|
19 |
| Model | Method | Sound | Music | Speech | Average |
|
20 |
|--------------------------------------------|-------------------------|--------|--------|--------|---------|
|
|
|
15 |
This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
|
16 |
For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
|
17 |
|
18 |
+
Our main findings are as follows:
|
19 |
+
|
20 |
+
- The GRPO algorithm can be directly and effectively applied to the audio modality, even to `Qwen2-Audio-7B-Instruct` with only 8.2B parameters.
|
21 |
+
- With only 38k post-training samples, reinforcement learning outperforms supervised fine-tuning, indicating that RL-based approaches can be effective without large datasets.
|
22 |
+
- The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently leverage *deep thinking* or step-by-step reasoning remains an open question for further research.
|
23 |
+
- Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.
|
24 |
+
|
25 |
+
Additional Notes:
|
26 |
+
The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
|
27 |
+
|
28 |
+
|
29 |
### Table: Accuracies (%) on MMAU Test-mini benchmark
|
30 |
| Model | Method | Sound | Music | Speech | Average |
|
31 |
|--------------------------------------------|-------------------------|--------|--------|--------|---------|
|