Spaces:

PRIME-RL
/

README

Running

App Files Files Community

stingning commited on 13 days ago

Commit

3d64e78

verified ·

1 Parent(s): 0825fd9

Update README.md

Browse files

Files changed (1) hide show

README.md +1 -54

README.md CHANGED Viewed

@@ -7,58 +7,5 @@ sdk: static
 pinned: false
 ---
-<div align="center">
-# Process Reinforcement Through Implicit Rewards
-</div>
-# Links
--  [Paper](https://arxiv.org/abs/2502.01456)
--  [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)
--  [GitHub](https://github.com/PRIME-RL/PRIME)
-# Evaluation
-Through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to **16.7%** improvement on average, and over **20%** on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks.
-The final results are presented below:
-|               | **Eurus-2-7B-PRIME** | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** |
-| ------------- | -------------------- | ------------------ | ----------------------------- | -------------------------- | ---------- |
-| AIME 2024     | **26.7 (+23.3)**     | 3.3                | 13.3                          | 16.7                       | 9.3        |
-| MATH-500      | 79.2 (+14.1)         | 65.1               | **79.8**                      | 64.6                       | 76.4       |
-| AMC           | **57.8 (+27.7)**     | 30.1               | 50.6                          | 30.1                       | 45.8       |
-| Minerva Math  | **38.6 (+5.9)**      | 32.7               | 34.6                          | 35.3                       | 36.8       |
-| OlympiadBench | 42.1 (+12.3)         | 29.8               | 40.7                          | 31.9                       | **43.3**   |
-| Avg.          | **48.9 (+16.7)**     | 32.2               | 43.8                          | 35.7                       | 43.3       |
-We achieved this with only 1/10 data and model resources compared with Qwen-Math.
-|            | **Eurus-2-7B-PRIME**               | **Qwen2.5-Math-7B-Instruct**    |
-| ---------- | ---------------------------------- | ------------------------------- |
-| Base Model | Qwen2.5-Math-7B                    | Qwen2.5-Math-7B                 |
-| SFT Data   | **230K (open-source)**             | 2.5M (open-source and in-house) |
-| RM Data    | **0**                              | 618K (in-house)                 |
-| RM         | **Eurus-2-7B-SFT**                 | Qwen2.5-Math-RM (72B)           |
-| RL Data    | **150K queries \times 4 samples**  | 66K queries \times 32 samples   |
-# Citation
-If you find PRIME or ImplicitPRM helpful, please cite us.
-```
-@article{cui2025process,
-  title={Process reinforcement through implicit rewards},
-  author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others},
-  journal={arXiv preprint arXiv:2502.01456},
-  year={2025}
-}
-```
-```
-@article{yuan2024implicitprm,
-  title={Free Process Rewards without Process Labels},
-  author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
-  journal={arXiv preprint arXiv:2412.01981},
-  year={2024}
-}
-```

 pinned: false
 ---
+Researching scalable (RL) methods on language models.