stingning commited on
Commit
3d64e78
·
verified ·
1 Parent(s): 0825fd9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -54
README.md CHANGED
@@ -7,58 +7,5 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- <div align="center">
11
 
12
- # Process Reinforcement Through Implicit Rewards
13
-
14
- </div>
15
-
16
-
17
- # Links
18
-
19
- - [Paper](https://arxiv.org/abs/2502.01456)
20
- - [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)
21
- - [GitHub](https://github.com/PRIME-RL/PRIME)
22
-
23
- # Evaluation
24
- Through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to **16.7%** improvement on average, and over **20%** on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks.
25
- The final results are presented below:
26
- | | **Eurus-2-7B-PRIME** | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** |
27
- | ------------- | -------------------- | ------------------ | ----------------------------- | -------------------------- | ---------- |
28
- | AIME 2024 | **26.7 (+23.3)** | 3.3 | 13.3 | 16.7 | 9.3 |
29
- | MATH-500 | 79.2 (+14.1) | 65.1 | **79.8** | 64.6 | 76.4 |
30
- | AMC | **57.8 (+27.7)** | 30.1 | 50.6 | 30.1 | 45.8 |
31
- | Minerva Math | **38.6 (+5.9)** | 32.7 | 34.6 | 35.3 | 36.8 |
32
- | OlympiadBench | 42.1 (+12.3) | 29.8 | 40.7 | 31.9 | **43.3** |
33
- | Avg. | **48.9 (+16.7)** | 32.2 | 43.8 | 35.7 | 43.3 |
34
-
35
-
36
- We achieved this with only 1/10 data and model resources compared with Qwen-Math.
37
- | | **Eurus-2-7B-PRIME** | **Qwen2.5-Math-7B-Instruct** |
38
- | ---------- | ---------------------------------- | ------------------------------- |
39
- | Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B |
40
- | SFT Data | **230K (open-source)** | 2.5M (open-source and in-house) |
41
- | RM Data | **0** | 618K (in-house) |
42
- | RM | **Eurus-2-7B-SFT** | Qwen2.5-Math-RM (72B) |
43
- | RL Data | **150K queries \times 4 samples** | 66K queries \times 32 samples |
44
-
45
- # Citation
46
- If you find PRIME or ImplicitPRM helpful, please cite us.
47
-
48
- ```
49
- @article{cui2025process,
50
- title={Process reinforcement through implicit rewards},
51
- author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others},
52
- journal={arXiv preprint arXiv:2502.01456},
53
- year={2025}
54
- }
55
- ```
56
-
57
- ```
58
- @article{yuan2024implicitprm,
59
- title={Free Process Rewards without Process Labels},
60
- author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
61
- journal={arXiv preprint arXiv:2412.01981},
62
- year={2024}
63
- }
64
- ```
 
7
  pinned: false
8
  ---
9
 
10
+ Researching scalable (RL) methods on language models.
11