Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Paper
•
2505.02391
•
Published
•
21
Workflow of Reinforcement Learning from Human Feedback (RLHF). Blog: https://rlhflow.github.io/