Update README.md
Browse files
README.md
CHANGED
|
@@ -7,58 +7,5 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
-
# Process Reinforcement Through Implicit Rewards
|
| 13 |
-
|
| 14 |
-
</div>
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
# Links
|
| 18 |
-
|
| 19 |
-
- [Paper](https://arxiv.org/abs/2502.01456)
|
| 20 |
-
- [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)
|
| 21 |
-
- [GitHub](https://github.com/PRIME-RL/PRIME)
|
| 22 |
-
|
| 23 |
-
# Evaluation
|
| 24 |
-
Through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to **16.7%** improvement on average, and over **20%** on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks.
|
| 25 |
-
The final results are presented below:
|
| 26 |
-
| | **Eurus-2-7B-PRIME** | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** |
|
| 27 |
-
| ------------- | -------------------- | ------------------ | ----------------------------- | -------------------------- | ---------- |
|
| 28 |
-
| AIME 2024 | **26.7 (+23.3)** | 3.3 | 13.3 | 16.7 | 9.3 |
|
| 29 |
-
| MATH-500 | 79.2 (+14.1) | 65.1 | **79.8** | 64.6 | 76.4 |
|
| 30 |
-
| AMC | **57.8 (+27.7)** | 30.1 | 50.6 | 30.1 | 45.8 |
|
| 31 |
-
| Minerva Math | **38.6 (+5.9)** | 32.7 | 34.6 | 35.3 | 36.8 |
|
| 32 |
-
| OlympiadBench | 42.1 (+12.3) | 29.8 | 40.7 | 31.9 | **43.3** |
|
| 33 |
-
| Avg. | **48.9 (+16.7)** | 32.2 | 43.8 | 35.7 | 43.3 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
We achieved this with only 1/10 data and model resources compared with Qwen-Math.
|
| 37 |
-
| | **Eurus-2-7B-PRIME** | **Qwen2.5-Math-7B-Instruct** |
|
| 38 |
-
| ---------- | ---------------------------------- | ------------------------------- |
|
| 39 |
-
| Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B |
|
| 40 |
-
| SFT Data | **230K (open-source)** | 2.5M (open-source and in-house) |
|
| 41 |
-
| RM Data | **0** | 618K (in-house) |
|
| 42 |
-
| RM | **Eurus-2-7B-SFT** | Qwen2.5-Math-RM (72B) |
|
| 43 |
-
| RL Data | **150K queries \times 4 samples** | 66K queries \times 32 samples |
|
| 44 |
-
|
| 45 |
-
# Citation
|
| 46 |
-
If you find PRIME or ImplicitPRM helpful, please cite us.
|
| 47 |
-
|
| 48 |
-
```
|
| 49 |
-
@article{cui2025process,
|
| 50 |
-
title={Process reinforcement through implicit rewards},
|
| 51 |
-
author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others},
|
| 52 |
-
journal={arXiv preprint arXiv:2502.01456},
|
| 53 |
-
year={2025}
|
| 54 |
-
}
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
```
|
| 58 |
-
@article{yuan2024implicitprm,
|
| 59 |
-
title={Free Process Rewards without Process Labels},
|
| 60 |
-
author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
|
| 61 |
-
journal={arXiv preprint arXiv:2412.01981},
|
| 62 |
-
year={2024}
|
| 63 |
-
}
|
| 64 |
-
```
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
Researching scalable (RL) methods on language models.
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|