---
license: mit
library_name: transformers
base_model:
  - deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
base_model_relation: adapter
---

# SeerAttention-R 

This repo contains the decode stage AttnGate weights from paper SeerAttention-R. The current support models are:
- [SeerAttention/SeerAttention-Decode-Qwen3-4B-AttnGates](https://huggingface.co/SeerAttention/SeerAttention-Decode-Qwen3-4B-AttnGates)
- [SeerAttention/SeerAttention-Decode-Qwen3-8B-AttnGates](https://huggingface.co/SeerAttention/SeerAttention-Decode-Qwen3-8B-AttnGates)
- [SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates](https://huggingface.co/SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates)
- [SeerAttention/SeerAttention-Decode-R1-Distill-Qwen-14B-AttnGates](https://huggingface.co/SeerAttention/SeerAttention-Decode-R1-Distill-Qwen-14B-AttnGates)  ← you are here!


## Results of Reasoning Tasks

Results of reasoning task with different token budgets. 
All the results are the averaged pass@1 results with 64 sample per query for AIME, 16 samples for GPQA, and 8 samples for MATH-500.

### AIME24

| Model                         | 2k    | 4k    | 6k    | 8k    | Full Attention |
|-------------------------------|-------|-------|-------|-------|----------------|
| Qwen3-4B                      | 55.42 | 68.75 | 70.94 | 72.50 | 71.25          |
| Qwen3-8B                      | 56.56 | 72.29 | 74.22 | 75.05 | 74.48          |
| Qwen3-14B                     | 62.24 | 75.78 | 78.02 | 78.65 | 78.91          |
| DeepSeek-R1-Distill-Qwen-14B  | 55.78 | 66.35 | 67.50 | 66.82 | 67.50          |


### AIME25

| Model                         | 2k    | 4k    | 6k    | 8k    | Full Attention |
|-------------------------------|-------|-------|-------|-------|----------------|
| Qwen3-4B                      | 45.73 | 57.60 | 60.20 | 62.90 | 66.41          |
| Qwen3-8B                      | 42.60 | 56.77 | 60.31 | 64.17 | 67.86          |
| Qwen3-14B                     | 46.67 | 62.66 | 67.19 | 69.01 | 70.21          |
| DeepSeek-R1-Distill-Qwen-14B  | 38.44 | 47.19 | 52.25 | 50.05 | 50.00          |


### MATH500

| Model                         | 1k    | 2k    | 4k    | 6k    | Full Attention |
|-------------------------------|-------|-------|-------|-------|----------------|
| Qwen3-4B                      | 84.80 | 92.20 | 93.60 | 93.60 | 93.93          |
| Qwen3-8B                      | 82.82 | 91.53 | 94.17 | 94.53 | 94.43          |
| Qwen3-14B                     | 85.13 | 93.20 | 94.77 | 94.80 | 95.22          |
| DeepSeek-R1-Distill-Qwen-14B  | 87.65 | 92.10 | 93.05 | 93.12 | 93.30          |


### GPQA Diamond

| Model                         | 1k    | 2k    | 4k    | 6k    | Full Attention |
|-------------------------------|-------|-------|-------|-------|----------------|
| Qwen3-4B                      | 39.61 | 51.20 | 55.20 | 55.90 | 56.19          |
| Qwen3-8B                      | 37.59 | 54.32 | 59.60 | 60.48 | 60.54          |
| Qwen3-14B                     | 44.54 | 59.72 | 63.76 | 64.20 | 65.25          |
| DeepSeek-R1-Distill-Qwen-14B  | 51.26 | 56.79 | 56.41 | 57.48 | 57.80          |