Eurus-2-7B-PRIME-Zero
Links
- ๐ Paper
- ๐ Blog
- ๐ค PRIME Collection
- ๐ค RL Data
Introduction
Eurus-2-7B-PRIME-Zero is trained using PRIME (Process Reinforcement through IMplicit rEward) method, an open-source solution for online reinforcement learning (RL) with process rewards, to advance reasoning abilities of language models beyond imitation or distillation. It starts with Qwen2.5-Math-7B-Base. RL from the base model converges way faster than the SFT model, surpassing the instruct version within 32 steps.

As shown in the animation above, in PRIME, the policy model and PRM are both initialized with the SFT model. For each RL iteration, the policy model first generates rollouts. Then, the implicit PRM and outcome verifier score the rollouts, and the implicit PRM gets updated on the rollouts with the outcome reward. Finally, the outcome reward and process reward are combined and used to update the policy model.
The PRIME implementation pseudocode is as follows:

The algorithm flow includes:
- Prompt filtering based on policy model performance, only preserving those on which the policy model achieves a accuracy between 0.2 and 0.8.
- Calculate implicit process reward .
- Update Implicit PRM based on predicted implicit process reward and ground truth outcome label .
- Advantage estimation with RLOO. Specifically, we first calculate the return of outcome rewards and implicit process rewards separately:
For ground truth outcome rewards, we directly adopt RLOO without any modification.
For implicit process rewards, we perform a three-step process to calculate return: (1) Use the averaged implicit process rewards to calculate the leave-one-out baseline (2) Normalize the process reward at step by subtracting the baseline; (3) Calculate the discounted return for each response.
Finally, advantage is set to the combination of both returns.
โ 5. Update the policy using PPO loss for legit importance sampling.
Usage
We apply tailored prompts for coding and math task:
Coding
{question} + "\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end."
Math
{question} + "\n\nPresent the answer in LaTex format: \\boxed{Your answer}"
import os
from tqdm import tqdm
import torch
from vllm import LLM, SamplingParams
os.environ["NCCL_IGNORE_DISABLED_P2P"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "true"
def generate(question_list,model_path):
llm = LLM(
model=model_path,
trust_remote_code=True,
tensor_parallel_size=torch.cuda.device_count(),
gpu_memory_utilization=0.90,
)
sampling_params = SamplingParams(max_tokens=8192,
temperature=0.0,
n=1)
outputs = llm.generate(question_list, sampling_params, use_tqdm=True)
completions = [[output.text for output in output_item.outputs] for output_item in outputs]
return completions
def make_conv_zero(question):
question = question + "\n\nPresent the answer in LaTex format: \\boxed{Your answer}"
content = f"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: {question}. Assistant:"
print(content)
return content
def run():
model_path = "PRIME-RL/Eurus-2-7B-PRIME-Zero"
all_problems = [
"which number is larger? 9.11 or 9.9?"
]
completions = generate([make_conv_zero(problem_data) for problem_data in all_problems],model_path)
print(completions)
# " Let's compare the numbers 9.11 and 9.9.\n\nTo determine which number is larger, we can compare them digit by digit from left to right.\n\nFirst, let's look at the digits in the ones place:\n- Both numbers have a 9 in the ones place.\n\nNext, let's look at the digits in the tenths place:\n- The number 9.11 has a 1 in the tenths place.\n- The number 9.9 has a 9 in the tenths place.\n\nSince 9 is greater than 1, we can conclude that 9.9 is greater than 9.11.\n\nSo, the larger number is ."
if __name__ == "__main__":
run()
Citation
@article{cui2025process,
title={Process reinforcement through implicit rewards},
author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others},
journal={arXiv preprint arXiv:2502.01456},
year={2025}
}
@article{yuan2024implicitprm,
title={Free Process Rewards without Process Labels},
author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
journal={arXiv preprint arXiv:2412.01981},
year={2024}
}
- Downloads last month
- 68