Papers
arxiv:2505.16022

NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Published on May 21
ยท Submitted by thinkwee on May 26
Authors:
,
,
,
,

Abstract

NOVER, a reinforcement learning framework that eliminates the need for external verifiers, enhances language model performance across text-to-text tasks.

AI-generated summary

Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.

Community

Paper author Paper submitter
โ€ข
edited 11 days ago

NOVER

  • NOVER (NO-VERifier) is a new post-training method that extends RLVR from math and coding to any area. It could perform DeepSeek R1-Zero-like incentive training on any SFT data, no verifier, no reward model needed.

  • It utilizes the policy model itself to derive a proxy model for reasoning perplexity-based reward modeling, achieving stable training and excellent performance across a range of tasks.

  • See more interesting discussions such as the "curse of proxy", "reasoning patterns evolving", and "inverse incentive training" in the paper.



I noticed a small typo in one of the figures โ€” "RLHF" is mistakenly written as "RLFH"
ๆˆชๅฑ2025-05-26 18.59.44.png

Paper author Paper submitter

I noticed a small typo in one of the figures โ€” "RLHF" is mistakenly written as "RLFH"
ๆˆชๅฑ2025-05-26 18.59.44.png

Thank you! I forgot to correct it. The arXiv version will be updated.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ยท
Paper author

Thank you, librarian-bot! You're a great bot! Among the papers you listed, three are closely related to NOVER:

  1. Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains โ€“ This paper also observes that most current reasoning models concentrate on math and coding tasks, and argues for greater attention to general reasoning. Unlike NOVER, however, it still focuses on domains with verifiable rewards and uses multi-stage training to build a reward model.

  2. General-Reasoner: Advancing LLM Reasoning Across All Domains โ€“ Similar to Crossing the Reward Bridge, this work builds a general verifier model for reasoning task evaluation across diverse domains.

Both papers emphasize general reasoning and adopt a model-as-verifier approach. However, in our NOVER experiments, we found that using a model as a verifier makes the system vulnerable to reward-hacking and hard to generalize.

  1. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? โ€“ This paper questions whether reinforcement learning actually pushes models beyond the capabilities of their pre-trained base, or simply improves sample efficiency. In NOVER, we observed a similar "pre-trained model ceiling" in general reasoning tasks: reward-incentivized models still struggle with false-premise questions, which are also difficult for the base model. Overcoming this ceiling is a key challenge for inference-time scaling via incentive-driven training.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.16022 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.16022 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.16022 in a Space README.md to link it from this page.

Collections including this paper 1