Abstract
Rectified Sparse Attention (ReSA) improves the efficiency of long-sequence generation in Large Language Models by combining block-sparse attention with periodic dense rectification, maintaining high-quality generation.
Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42times end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.
Community
Efficient long-sequence generation is a critical challenge for Large Language
Models. While recent sparse decoding methods improve efficiency, they suffer
from KV cache misalignment, where approximation errors accumulate and degrade
generation quality. In this work, we propose Rectified Sparse Attention (ReSA),
a simple yet effective method that combines block-sparse attention with
periodic dense rectification. By refreshing the KV cache at fixed intervals
using a dense forward pass, ReSA bounds error accumulation and preserves
alignment with the pretraining distribution. Experiments across math reasoning,
language modeling, and retrieval tasks demonstrate that ReSA achieves
near-lossless generation quality with significantly improved efficiency.
Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at
256K sequence length, making it a practical solution for scalable long-context
inference. Code is available at https://aka.ms/ReSA-LM.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Efficient Pretraining Length Scaling (2025)
- SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling (2025)
- Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs (2025)
- RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference (2025)
- AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity (2025)
- HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference (2025)
- SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper