S1.1-QwQ-DS
This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-32B on the S1.1-QwQ dataset.
The model has achieved state-of-the-art reasoning capabilities on challengining benchmarks including AIME2024/2025, MATH500 and GPQA-Diamond.
Training and evaluation data
We utilize LLaMAFactory with $8\times A100-SXM4-80GB$ GPU to conduct full-parameter finetuning on our self-curated S1.1-QWQ dataset, which is another refined version of S1.1-1K dataset.
We use QwQ-32B to generate reasoning trajectories for each of the problem in S1.1-1k dataset. The experiment turns out that the quality of QwQ generated trajectories are better than the original version including (Gemini-2.0-flash-thinking and DeepSeek-R1).
Dataset: S1.1-QwQ
Here we present the evaluation results of our S1.1-QwQ-DS/Qwen-32B on challenging reasoning tasks including AIME2024,AIM2025,MATH500 and GPQA-Diamond.
Model | Model Size | AIME2024 | AIME2025 | MATH500 | GPQA |
---|---|---|---|---|---|
Qwen2.5-Instruct | 32B | 16.7 | 26.7 | 84.2 | 48.5 |
+S1-1k (Gemini-2.0-flash-thinking) | 32B | 56.7 | 26.7 | 93.0 | 59.6 |
+S1.1-32B (R1) | 32B | 56.7 | 60.0 | 95.4 | 63.6 |
S1.1-QwQ-Qwen-32B (Ours) | 32B | 66.7 | 60.0 | 95.8 | 64.7 |
S1.1-QwQ-DS-32B (Ours) | 32B | 83.3 | 73.3 | 96.4 | 66.7 |
Compare to other version of s1-1k dataset, our newly curated dataset has demonstrate the supeority of performance gains based on Qwen2.5-32B-Instruct and DeepSeek-R1-Distill-Qwen-32B over all benchmarks.
We also compare our results with more open-source reasoning LLMs:
Category | Model | Model Size | AIME 2024 | AIME 2025 | MATH500 | GPQA |
---|---|---|---|---|---|---|
Industrial Models | QwQ | 32B | 80.0 | 60.0 | 97.6 | 68.2 |
DeepSeek-R1 | 671B | 79.8 | - | 97.3 | 71.5 | |
Open-Sourced Models | Qwen2.5-Instruct | 32B | 16.7 | 26.7 | 84.2 | 48.5 |
R1-Distill-Qwen2.5 | 7B | 50.0 | 40.0 | 92.6 | 47.0 | |
R1-Distill-Qwen2.5 | 14B | 60.0 | 26.7 | 92.0 | 52.0 | |
R1-Distill-Qwen2.5 | 32B | 70.0 | 46.7 | 92.0 | 59.6 | |
OpenThinker | 32B | 63.3 | 46.7 | 94.8 | 60.1 | |
FuseO1-Preview | 32B | 76.7 | 40.0 | 93.4 | 59.1 | |
Tiny-R1 | 32B | 76.7 | 53.3 | 95.4 | - | |
Light-R1 | 32B | 78.1 | 65.9 | 96.2 | 68.0 | |
EXAONE-Deep | 32B | 70.0 | 60.0 | 96.2 | 64.6 | |
LIMO | 32B | 56.7 | 33.3 | 92.2 | 58.8 | |
Our Model | S1.1-QwQ-DS | 32B | 83.3 | 73.3 | 96.4 | 66.7 |
We provide our evaluation results in folder eval_result.
Quick start with VLLM
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = 'BitStarWalkin/S1.1-QwQ-DS'
model = LLM(
model_id,
tensor_parallel_size=8,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
max_tokens=16384,
)
question = """Let \(x, y\), and \(z\) be positive real numbers satisfying the system of equations:
\[
\begin{array}{c}
\sqrt{2 x-x y}+\sqrt{2 y-x y}=1 \\
\sqrt{2 y-y z}+\sqrt{2 z-y z}=\sqrt{2} \\
\sqrt{2 z-z x}+\sqrt{2 x-z x}=\sqrt{3} .
\end{array}
\]
Then \(\left[(1-x)(1-y)(1-z)\right]^{2}\) can be written as \(\frac{m}{n}\), where \(m\) and \(n\) are relatively prime positive integers. Find \(m+n\)."""
ds_prompt="<|User|>\n" + question + "<|Assistant|>\n"
output = model.generate(ds_prompt, sampling_params=sampling_params)
print(output[0].outputs[0].text)
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 5.0
Framework versions
- Transformers 4.49.0
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0
- Downloads last month
- 6
Model tree for BitStarWalkin/S1.1-QwQ-DS
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B