Update README.md
Browse files
README.md
CHANGED
@@ -25,6 +25,8 @@ Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R
|
|
25 |
- Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
|
26 |
Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
|
27 |
Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
|
|
|
|
|
28 |
More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
|
29 |
|
30 |
|
|
|
25 |
- Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
|
26 |
Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
|
27 |
Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
|
28 |
+
- RLHFlow/Qwen2.5-7B-DPO-Zero is trained with preference learning from the base model Qwen2.5-Math-7B-Base.
|
29 |
+
|
30 |
More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
|
31 |
|
32 |
|