Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction
Abstract
Sparse Query Attention (SQA) reduces computational complexity in Transformer models by decreasing the number of Query heads, leading to significant throughput improvements with minimal impact on model quality.
The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models
Community
The paper introducing Sparse Query Attention (SQA) for more computationally efficient training and prompt phase. The experiments were small scale, because of limited budget, but we will test it in real scale in future work
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs (2025)
- SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers (2025)
- AQUA: Attention via QUery mAgnitudes for Memory and Compute Efficient Inference in LLMs (2025)
- TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference (2025)
- Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution (2025)
- Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel (2025)
- Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper