Abstract
Markovian Thinking, implemented in Delethink, enables efficient and scalable reinforcement learning for long-chain-of-thought reasoning in LLMs by decoupling thinking length from context size, resulting in linear compute and constant memory usage.
Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.
Community
Takeaways
- RLVR is governed by a trivial MDP that everyone has forgotten. We can actually design any MDP we want.
- By making the state size bounded, we propose the Markovian Thinking paradigm, where the model learn to advance its reasoning by only conditioning on fixed-size states.
- Delethink is simple and effective: with an 8K fixed state it matches or beats LongCoT-RL and can think up to 128K tokens.
- GPT-OSS 120B and Qwen3 30B-A3B already show strong signs of Markovian thinking.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking (2025)
- Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models (2025)
- VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models (2025)
- K2-Think: A Parameter-Efficient Reasoning System (2025)
- Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression (2025)
- Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners (2025)
- rStar2-Agent: Agentic Reasoning Technical Report (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper