Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs
Abstract
Ring-lite uses a MoE architecture and reinforcement learning to efficiently match SOTA reasoning models while activating fewer parameters and addressing challenges specific to MoE training.
We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning (2025)
- QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning (2025)
- MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining (2025)
- 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models (2025)
- Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math (2025)
- DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training (2025)
- General-Reasoner: Advancing LLM Reasoning Across All Domains (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper