Overview
This experiment tests whether a small-scale SFT on a 30B+ model while preserving its general-purpose abilities—can improve its performance on math and code tasks.
Base Model
Qwen/Qwen3-32B https://huggingface.co/Qwen/Qwen3-32B
Data
Translation Model
Qwen/Qwen3-235B-A22B (https://huggingface.co/Qwen/Qwen3-235B-A22B)
Datasets
HuggingFaceTB/smoltalk
- Purpose: Maintain general-purpose capabilities with minimal data
- Samples: 15 k English originals + 15 k Korean translations
- https://huggingface.co/datasets/HuggingFaceTB/smoltalk
LLM360/guru-RL-92k (https://huggingface.co/datasets/LLM360/guru-RL-92k)
- Domain: Math (composed of OR1, DAPO, DeepScaler)
- Samples: 1 k English originals + 1 k Korean translations
- https://huggingface.co/datasets/LLM360/guru-RL-92k
PrimeIntellect/SYNTHETIC-2-SFT-verified
- Domain: Math & Code
- Samples: 1 k English originals + 1 k Korean translations
- https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified
Train
Hardware
2 nodes × 8 H100 GPUs each (16 × H100 total)
Pipeline
- Deepspeed-Chat (https://github.com/deepspeedai/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat)
Arguments
- Deepspeed Zero 3
- Batch size: 1
- Gradient accumulation steps: 16
- Max sequence length: 10 246 tokens
- Learning rate: 9.65 × 10⁻⁶
- LR scheduler: cosine
- Warmup steps: 500
- Seed: 1234
Evaluation
Pipeline
- Framework: OpenCompass
- Execution: Built-in vLLM pipeline on GitHub
Result
Benchmark | Base Model (Qwen3-32B) | Fine-tuned (SFT) |
---|---|---|
ARC-c | 55.59 | 50.17 |
BBH | 79.90 | 51.85 |
GSM8K | 92.87 | 69.07 |
MMLU | 85.70 | 73.79 |
NQ | 10.39 | 11.99 |
Limitation
NQ was the only benchmark to show a performance gain.
Not recommended for deployment in production services that rely heavily on broad Knowledge-of-the-World abilities.
- Downloads last month
- 39
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support