AI & ML interests
Workflow of Reinforcement Learning from Human Feedback (RLHF). Blog: https://rlhflow.github.io/
Recent Activity
We collect the open-source datasets and process them into the standard format.
We train the reward model as the maximum likelihood estimation of the Bradley-Terry model.
Datasets, code, and models for online RLHF (i.e., iterative DPO)
We train a series of SFT models on the high-quality SFT dataset of RLHFlow for research purpose.
This is the collection of the online-DPO-R1 project.
This is a collection of datasets and models of process reward modeling.
The mixture of preference datasets used for reward modeling.
This is a collection of materials for training pairwise preference model.
Reward models trained by RLHFlow codebase (https://github.com/RLHFlow/RLHF-Reward-Modeling/)
-
RLHFlow/ArmoRM-Llama3-8B-v0.1
Text Classification • 8B • Updated • 19.6k • 179 -
RLHFlow/pair-preference-model-LLaMA3-8B
Text Generation • 8B • Updated • 2.09k • 38 -
sfairXC/FsfairX-LLaMA3-RM-v0.1
Text Classification • 8B • Updated • 2.13k • 59 -
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 72
This is the collection of the online-DPO-R1 project.
This is a collection of datasets and models of process reward modeling.
We collect the open-source datasets and process them into the standard format.
The mixture of preference datasets used for reward modeling.
We train the reward model as the maximum likelihood estimation of the Bradley-Terry model.
This is a collection of materials for training pairwise preference model.
Datasets, code, and models for online RLHF (i.e., iterative DPO)
Reward models trained by RLHFlow codebase (https://github.com/RLHFlow/RLHF-Reward-Modeling/)
-
RLHFlow/ArmoRM-Llama3-8B-v0.1
Text Classification • 8B • Updated • 19.6k • 179 -
RLHFlow/pair-preference-model-LLaMA3-8B
Text Generation • 8B • Updated • 2.09k • 38 -
sfairXC/FsfairX-LLaMA3-RM-v0.1
Text Classification • 8B • Updated • 2.13k • 59 -
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 72
We train a series of SFT models on the high-quality SFT dataset of RLHFlow for research purpose.