Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
RLHFlow
's Collections
Minimal-RL
Online-DPO-R1
Decision-Tree Reward Models
RLHFlow MATH Process Reward Model
Standard-format-preference-dataset
Mixture-of-preference-reward-modeling
RM-Bradley-Terry
PM-pair
Online RLHF
RLHFLow Reward Models
SFT Models
Online-DPO-R1
updated
Feb 28
This is the collection of the online-DPO-R1 project.
Upvote
-
RLHFlow/Qwen2.5-7B-PPO-Zero
8B
•
Updated
Feb 17
•
5
•
2
RLHFlow/Qwen2.5-7B-DPO-Zero
8B
•
Updated
Feb 17
•
5
RLHFlow/Qwen2.5-7B-DPO-NLL-Zero
8B
•
Updated
Feb 17
•
3
RLHFlow/Qwen2.5-7B-RAFT-Zero
8B
•
Updated
Feb 17
•
3
RLHFlow/numia_prompt_ppo
Viewer
•
Updated
Feb 13
•
404k
•
9
•
1
RLHFlow/numia_prompt_dpo1
Viewer
•
Updated
Feb 11
•
20k
•
730
RLHFlow/Qwen2.5-7B-DPO
8B
•
Updated
Feb 17
•
4
RLHFlow/Qwen2.5-7B-SFT
8B
•
Updated
Feb 17
•
6
RLHFlow/qwq_gen_sft_15k
Viewer
•
Updated
Feb 17
•
15k
•
21
RLHFlow/numia_prompt_dpo2
Viewer
•
Updated
Feb 11
•
20k
•
16
RLHFlow/numia_prompt_dpo3
Viewer
•
Updated
Feb 11
•
20k
•
18
Self-rewarding correction for mathematical reasoning
Paper
•
2502.19613
•
Published
Feb 26
•
84
Upvote
-
Share collection
View history
Collection guide
Browse collections