S1.1-QwQ-DS

This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-32B on the S1.1-QwQ dataset.

The model has achieved state-of-the-art reasoning capabilities on challengining benchmarks including AIME2024/2025, MATH500 and GPQA-Diamond.

Training and evaluation data

We utilize LLaMAFactory with $8\times A100-SXM4-80GB$ GPU to conduct full-parameter finetuning on our self-curated S1.1-QWQ dataset, which is another refined version of S1.1-1K dataset.

We use QwQ-32B to generate reasoning trajectories for each of the problem in S1.1-1k dataset. The experiment turns out that the quality of QwQ generated trajectories are better than the original version including (Gemini-2.0-flash-thinking and DeepSeek-R1).

Dataset: S1.1-QwQ

Here we present the evaluation results of our S1.1-QwQ-DS/Qwen-32B on challenging reasoning tasks including AIME2024,AIM2025,MATH500 and GPQA-Diamond.

Model Model Size AIME2024 AIME2025 MATH500 GPQA
Qwen2.5-Instruct 32B 16.7 26.7 84.2 48.5
+S1-1k (Gemini-2.0-flash-thinking) 32B 56.7 26.7 93.0 59.6
+S1.1-32B (R1) 32B 56.7 60.0 95.4 63.6
S1.1-QwQ-Qwen-32B (Ours) 32B 66.7 60.0 95.8 64.7
S1.1-QwQ-DS-32B (Ours) 32B 83.3 73.3 96.4 66.7

Compare to other version of s1-1k dataset, our newly curated dataset has demonstrate the supeority of performance gains based on Qwen2.5-32B-Instruct and DeepSeek-R1-Distill-Qwen-32B over all benchmarks.

We also compare our results with more open-source reasoning LLMs:

Category Model Model Size AIME 2024 AIME 2025 MATH500 GPQA
Industrial Models QwQ 32B 80.0 60.0 97.6 68.2
DeepSeek-R1 671B 79.8 - 97.3 71.5
Open-Sourced Models Qwen2.5-Instruct 32B 16.7 26.7 84.2 48.5
R1-Distill-Qwen2.5 7B 50.0 40.0 92.6 47.0
R1-Distill-Qwen2.5 14B 60.0 26.7 92.0 52.0
R1-Distill-Qwen2.5 32B 70.0 46.7 92.0 59.6
OpenThinker 32B 63.3 46.7 94.8 60.1
FuseO1-Preview 32B 76.7 40.0 93.4 59.1
Tiny-R1 32B 76.7 53.3 95.4 -
Light-R1 32B 78.1 65.9 96.2 68.0
EXAONE-Deep 32B 70.0 60.0 96.2 64.6
LIMO 32B 56.7 33.3 92.2 58.8
Our Model S1.1-QwQ-DS 32B 83.3 73.3 96.4 66.7

We provide our evaluation results in folder eval_result.

Quick start with VLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = 'BitStarWalkin/S1.1-QwQ-DS'
model = LLM(
    model_id,
    tensor_parallel_size=8,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
    max_tokens=16384,
)
question = """Let \(x, y\), and \(z\) be positive real numbers satisfying the system of equations:
\[
\begin{array}{c}
\sqrt{2 x-x y}+\sqrt{2 y-x y}=1 \\
\sqrt{2 y-y z}+\sqrt{2 z-y z}=\sqrt{2} \\
\sqrt{2 z-z x}+\sqrt{2 x-z x}=\sqrt{3} .
\end{array}
\]
Then \(\left[(1-x)(1-y)(1-z)\right]^{2}\) can be written as \(\frac{m}{n}\), where \(m\) and \(n\) are relatively prime positive integers. Find \(m+n\)."""
ds_prompt="<|User|>\n" + question + "<|Assistant|>\n"
output = model.generate(ds_prompt, sampling_params=sampling_params)
print(output[0].outputs[0].text)

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 1
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 16
  • total_eval_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • num_epochs: 5.0

Framework versions

  • Transformers 4.49.0
  • Pytorch 2.5.1+cu124
  • Datasets 3.2.0
  • Tokenizers 0.21.0
Downloads last month
6
Safetensors
Model size
32.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BitStarWalkin/S1.1-QwQ-DS

Finetuned
(69)
this model
Quantizations
2 models