arxiv:2506.24119

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Published on Jun 30

· Submitted by

Benjamin-eecs on Jul 1

#2 Paper of the day

Upvote

Authors:

Bo Liu ,

Simon Yu ,

Zichen Liu ,

Abstract

Self-play in zero-sum games using SPIRAL enhances reasoning capabilities in language models through self-improvement and transfer learning.

AI-generated summary

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

View arXiv page View PDF Project page GitHub 40 Add to collection

Community

Benjamin-eecs

Paper author Paper submitter 1 day ago

SPIRAL demonstrates that language models can learn advanced reasoning by playing simple zero-sum language games against themselves, without needing math problems or human data.

Key results:

Training on Kuhn Poker alone improves math reasoning by 8.7% and general reasoning by 8.4%
Outperforms supervised fine-tuning on 25,000 expert game trajectories
Works on both base models (Qwen3-4B) and strong reasoning models (DeepSeek-R1-Distill-Qwen-7B)

How it works:

Models play zero-sum games (TicTacToe, Kuhn Poker, Simple Negotiation) against continuously improving versions of themselves
Game outcomes provide automatic rewards without human supervision
Self-play creates an infinite curriculum that adapts to the model's current skill level
Role-conditioned Advantage Estimation (RAE) prevents reasoning collapse in competitive settings

Why it matters:

A fully online, multi-turn, multi-agent RL system for LLMs
Games teach transferable reasoning patterns: systematic decomposition, expected value calculation, and case-by-case analysis
Different games develop complementary skills that combine synergistically
Opens a path to autonomous reasoning development without expensive human-curated datasets

Code & Models: https://github.com/spiral-rl/spiral