Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
Abstract
Adamas, a sparse attention mechanism, achieves high accuracy and speed in long-context inference by using Hadamard transform, bucketization, 2-bit compression, and Manhattan-distance estimation.
Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.
Community
submit
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ProxyAttn: Guided Sparse Attention via Representative Heads (2025)
- DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning (2025)
- InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation (2025)
- VideoNSA: Native Sparse Attention Scales Video Understanding (2025)
- KVCompose: Efficient Structured KV Cache Compression with Composite Tokens (2025)
- SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers (2025)
- Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper