Update README.md (#6)

Browse files

- Update README.md (8c1459b2c0c2384cf288f48ddbec31641a578178)

Co-authored-by: Gang Li <[email protected]>

Files changed (1) hide show

README.md +11 -0

README.md CHANGED Viewed

@@ -15,6 +15,17 @@ R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instru
 This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
 For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
 ### Table: Accuracies (%) on MMAU Test-mini benchmark
 | Model                                      | Method                  | Sound  | Music  | Speech | Average |
 |--------------------------------------------|-------------------------|--------|--------|--------|---------|

 This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
 For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
+Our main findings are as follows:
+- The GRPO algorithm can be directly and effectively applied to the audio modality, even to `Qwen2-Audio-7B-Instruct` with only 8.2B parameters.
+- With only 38k post-training samples, reinforcement learning outperforms supervised fine-tuning, indicating that RL-based approaches can be effective without large datasets.
+- The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently leverage *deep thinking* or step-by-step reasoning remains an open question for further research.
+- Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.
+Additional Notes:
+The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
 ### Table: Accuracies (%) on MMAU Test-mini benchmark
 | Model                                      | Method                  | Sound  | Music  | Speech | Average |
 |--------------------------------------------|-------------------------|--------|--------|--------|---------|