RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
Abstract
Tango is an RL framework that concurrently trains a generative LLM and a RL-trained verifier, achieving superior robustness and generalization in math benchmarks and out-of-domain reasoning tasks.
Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
Community
We introduce Tango, a novel and effective co-evolutionary framework that uses RL to concurrently train an LLM generator and an LLM verifier in an interleaved manner for LLM reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning (2025)
- GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning (2025)
- Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs (2025)
- General-Reasoner: Advancing LLM Reasoning Across All Domains (2025)
- Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling (2025)
- RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning (2025)
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper