Papers
arxiv:2506.24119

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Published on Jun 30
· Submitted by Benjamin-eecs on Jul 1
#2 Paper of the day
Authors:
Bo Liu ,
,
,
,
,
,
,
,
,

Abstract

Self-play in zero-sum games using SPIRAL enhances reasoning capabilities in language models through self-improvement and transfer learning.

AI-generated summary

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

Community

Paper author Paper submitter

SPIRAL demonstrates that language models can learn advanced reasoning by playing simple zero-sum language games against themselves, without needing math problems or human data.

Key results:

  • Training on Kuhn Poker alone improves math reasoning by 8.7% and general reasoning by 8.4%
  • Outperforms supervised fine-tuning on 25,000 expert game trajectories
  • Works on both base models (Qwen3-4B) and strong reasoning models (DeepSeek-R1-Distill-Qwen-7B)

How it works:

  1. Models play zero-sum games (TicTacToe, Kuhn Poker, Simple Negotiation) against continuously improving versions of themselves
  2. Game outcomes provide automatic rewards without human supervision
  3. Self-play creates an infinite curriculum that adapts to the model's current skill level
  4. Role-conditioned Advantage Estimation (RAE) prevents reasoning collapse in competitive settings

Why it matters:

  • A fully online, multi-turn, multi-agent RL system for LLMs
  • Games teach transferable reasoning patterns: systematic decomposition, expected value calculation, and case-by-case analysis
  • Different games develop complementary skills that combine synergistically
  • Opens a path to autonomous reasoning development without expensive human-curated datasets

Code & Models: https://github.com/spiral-rl/spiral

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.24119 in a Space README.md to link it from this page.

Collections including this paper 3