---
license: apache-2.0
datasets:
- Tongyi-Zhiwen/DocQA-RL-1.6K
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
tags:
- long-context
- large-reasoning-model
pipeline_tag: text-generation
library_name: transformers
---
# QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
-----------------------------
[](https://opensource.org/licenses/Apache-2.0)
[](https://arxiv.org/abs/2505.17667)
[](https://github.com/Tongyi-Zhiwen/QwenLong-L1)
[](https://modelscope.cn/models/iic/QwenLong-L1-32B)
[](https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B)
_**Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li,**_
_**Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan**_
_Tongyi Lab, Alibaba Group_
## π News
- **May 28, 2025:** π₯ We release [π€ QwenLong-L1-32B-AWQ](https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B-AWQ), which has undergone AWQ int4 quantization using the ms-swift framework.
- **May 26, 2025:** π₯ We release [π€ QwenLong-L1-32B](https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B), which is the first long-context LRM trained with reinforcement learning for long-context reasoning. Experiments on seven long-context DocQA benchmarks demonstrate that **QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking**, demonstrating leading performance among state-of-the-art LRMs.
- **May 26, 2025:** π₯ We release [π€ DocQA-RL-1.6K](https://huggingface.co/datasets/Tongyi-Zhiwen/DocQA-RL-1.6K), which is a specialized RL training dataset comprising 1.6K document question answering (DocQA) problems spanning mathematical, logical, and multi-hop reasoning domains.
## π Introduction
In this work, we propose QwenLong-L1, a novel reinforcement learning (RL) framework designed to facilitate the transition of LRMs from short-context proficiency to robust long-context generalization. In our preliminary experiments, we illustrate the differences between the training dynamics of short-context and long-context reasoning RL.