--- license: mit library_name: transformers base_model: - deepseek-ai/DeepSeek-R1-Distill-Qwen-32B base_model_relation: adapter --- ## SeerAttention-DeepSeek-R1-Distill-Qwen-32B-AttnGates This repo only contains the AttnGates' weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-32B. [SeerAttention](https://arxiv.org/abs/2410.13276) introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel. Original Github Repo https://github.com/microsoft/SeerAttention. ## LongBenchV2 CoT Benchmark All the SeerAttention models run with threshold=5e-4. For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240. | Model | Overall | Easy | Hard | Short | Medium | Long | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | 30.4 | 31.2 | 29.9 | 37.8 | 24.7 | 29.6 | | [SeerAttention-Llama-3.1-8B](https://huggingface.co/SeerAttention/SeerAttention-Llama-3.1-8B-AttnGates) | 31.6 | 33.3 | 30.5 | 33.9 | 31.6 | 27.8 | | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) | 34.8 | 37.5 | 33.1 | 44.4 | 32.1 | 24.1 | | [SeerAttention-Qwen2.5-14B](https://huggingface.co/SeerAttention/SeerAttention-Qwen2.5-14B-AttnGates) | 32.8 | 38.0 | 29.6 | 45.0 | 30.2 | 17.6 | | [Qwen2.5-32B-Instruct]((https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)) | 36.4 | 42.2 | 32.8 | 47.8 | 29.8 | 30.6 | | [SeerAttention-Qwen2.5-32B](https://huggingface.co/SeerAttention/SeerAttention-Qwen2.5-32B-AttnGates) | 36.4 | 41.1 | 33.4 | 49.4 | 29.8 | 27.8 | | [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) | 34.2 | 43.2 | 28.6 | 45.0 | 27.9 | 28.7 | | [SeerAttention-DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-14B-AttnGates) | 31.6 | 35.9 | 28.9 | 41.7 | 26.0 | 25.9 | | [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | 37.2 | 42.7 | 33.8 | 47.2 | 35.8 | 23.1 | | [SeerAttention-DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/SeerAttention/SeerAttention-DeepSeek-R1-Distill-Qwen-32B-AttnGates) | 37.0 | 42.2 | 33.8 | 49.4 | 31.6 | 26.9 |