Overview

This experiment tests whether a small-scale SFT on a 30B+ model while preserving its general-purpose abilities—can improve its performance on math and code tasks.

Base Model

Qwen/Qwen3-32B https://huggingface.co/Qwen/Qwen3-32B

Data

Translation Model

Qwen/Qwen3-235B-A22B (https://huggingface.co/Qwen/Qwen3-235B-A22B)

Datasets

  1. HuggingFaceTB/smoltalk

  2. LLM360/guru-RL-92k (https://huggingface.co/datasets/LLM360/guru-RL-92k)

  3. PrimeIntellect/SYNTHETIC-2-SFT-verified

Train

Hardware

2 nodes × 8 H100 GPUs each (16 × H100 total)

Pipeline

Arguments

  • Deepspeed Zero 3
  • Batch size: 1
  • Gradient accumulation steps: 16
  • Max sequence length: 10 246 tokens
  • Learning rate: 9.65 × 10⁻⁶
  • LR scheduler: cosine
  • Warmup steps: 500
  • Seed: 1234

Evaluation

Pipeline

  • Framework: OpenCompass
  • Execution: Built-in vLLM pipeline on GitHub

Result

Benchmark Base Model (Qwen3-32B) Fine-tuned (SFT)
ARC-c 55.59 50.17
BBH 79.90 51.85
GSM8K 92.87 69.07
MMLU 85.70 73.79
NQ 10.39 11.99

Limitation

NQ was the only benchmark to show a performance gain.

Not recommended for deployment in production services that rely heavily on broad Knowledge-of-the-World abilities.

Downloads last month
39
Safetensors
Model size
32.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support