S1.1-QwQ-DS

This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-32B on the S1.1-QwQ dataset.

The model has achieved state-of-the-art reasoning capabilities on challengining benchmarks including AIME2024/2025, MATH500 and GPQA-Diamond.

Training and evaluation data

We utilize LLaMAFactory with $8\times A100-SXM4-80GB$ GPU to conduct full-parameter finetuning on our self-curated S1.1-QWQ dataset, which is another refined version of S1.1-1K dataset.

We use QwQ-32B to generate reasoning trajectories for each of the problem in S1.1-1k dataset. The experiment turns out that the quality of QwQ generated trajectories are better than the original version including (Gemini-2.0-flash-thinking and DeepSeek-R1).

Dataset: S1.1-QwQ

Here we present the evaluation results of our S1.1-QwQ-DS/Qwen-32B on challenging reasoning tasks including AIME2024,AIM2025,MATH500 and GPQA-Diamond.

Model	Model Size	AIME2024	AIME2025	MATH500	GPQA
Qwen2.5-Instruct	32B	16.7	26.7	84.2	48.5
+S1-1k (Gemini-2.0-flash-thinking)	32B	56.7	26.7	93.0	59.6
+S1.1-32B (R1)	32B	56.7	60.0	95.4	63.6
S1.1-QwQ-Qwen-32B (Ours)	32B	66.7	60.0	95.8	64.7
S1.1-QwQ-DS-32B (Ours)	32B	83.3	73.3	96.4	66.7

Compare to other version of s1-1k dataset, our newly curated dataset has demonstrate the supeority of performance gains based on Qwen2.5-32B-Instruct and DeepSeek-R1-Distill-Qwen-32B over all benchmarks.

We also compare our results with more open-source reasoning LLMs:

Category	Model	Model Size	AIME 2024	AIME 2025	MATH500	GPQA
Industrial Models	QwQ	32B	80.0	60.0	97.6	68.2
	DeepSeek-R1	671B	79.8	-	97.3	71.5
Open-Sourced Models	Qwen2.5-Instruct	32B	16.7	26.7	84.2	48.5
	R1-Distill-Qwen2.5	7B	50.0	40.0	92.6	47.0
	R1-Distill-Qwen2.5	14B	60.0	26.7	92.0	52.0
	R1-Distill-Qwen2.5	32B	70.0	46.7	92.0	59.6
	OpenThinker	32B	63.3	46.7	94.8	60.1
	FuseO1-Preview	32B	76.7	40.0	93.4	59.1
	Tiny-R1	32B	76.7	53.3	95.4	-
	Light-R1	32B	78.1	65.9	96.2	68.0
	EXAONE-Deep	32B	70.0	60.0	96.2	64.6
	LIMO	32B	56.7	33.3	92.2	58.8
Our Model	S1.1-QwQ-DS	32B	83.3	73.3	96.4	66.7

We provide our evaluation results in folder eval_result.

Quick start with VLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = 'BitStarWalkin/S1.1-QwQ-DS'
model = LLM(
    model_id,
    tensor_parallel_size=8,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
    max_tokens=16384,
)
question = """Let \(x, y\), and \(z\) be positive real numbers satisfying the system of equations:
\[
\begin{array}{c}
\sqrt{2 x-x y}+\sqrt{2 y-x y}=1 \\
\sqrt{2 y-y z}+\sqrt{2 z-y z}=\sqrt{2} \\
\sqrt{2 z-z x}+\sqrt{2 x-z x}=\sqrt{3} .
\end{array}
\]
Then \(\left[(1-x)(1-y)(1-z)\right]^{2}\) can be written as \(\frac{m}{n}\), where \(m\) and \(n\) are relatively prime positive integers. Find \(m+n\)."""
ds_prompt="<｜User｜>\n" + question + "<｜Assistant｜>\n"
output = model.generate(ds_prompt, sampling_params=sampling_params)
print(output[0].outputs[0].text)

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 16
total_eval_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
num_epochs: 5.0

Framework versions

Transformers 4.49.0
Pytorch 2.5.1+cu124
Datasets 3.2.0
Tokenizers 0.21.0

BitStarWalkin
/

S1.1-QwQ-DS

S1.1-QwQ-DS

Training and evaluation data

Quick start with VLLM

Training hyperparameters

Framework versions

Model tree for BitStarWalkin/S1.1-QwQ-DS

Evaluation results