arxiv:2510.09883

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Published on Oct 10

Authors:

Abstract

DELTA, a training-free sparse attention mechanism, enhances computational efficiency in large reasoning models without compromising accuracy by selectively attending to salient tokens.

AI-generated summary

Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. Existing sparse attention methods reduce computation by pruning the key-value (KV) cache, yet they suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the dynamic importance of tokens over long derivations. We present DELTA, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of selection layers that identify salient tokens via aggregated head-level attention scores, and subsequent sparse-attention layers that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to 5times and delivering 1.5times end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.09883 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.09883 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.09883 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.