Overview

This experiment tests whether a small-scale SFT on a 30B+ model while preserving its general-purpose abilities—can improve its performance on math and code tasks.

Base Model

Qwen/Qwen3-32B https://huggingface.co/Qwen/Qwen3-32B

Data

Translation Model

Qwen/Qwen3-235B-A22B (https://huggingface.co/Qwen/Qwen3-235B-A22B)

Datasets

HuggingFaceTB/smoltalk
- Purpose: Maintain general-purpose capabilities with minimal data
- Samples: 15 k English originals + 15 k Korean translations
- https://huggingface.co/datasets/HuggingFaceTB/smoltalk
LLM360/guru-RL-92k (https://huggingface.co/datasets/LLM360/guru-RL-92k)
- Domain: Math (composed of OR1, DAPO, DeepScaler)
- Samples: 1 k English originals + 1 k Korean translations
- https://huggingface.co/datasets/LLM360/guru-RL-92k
PrimeIntellect/SYNTHETIC-2-SFT-verified
- Domain: Math & Code
- Samples: 1 k English originals + 1 k Korean translations
- https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified

Train

Hardware

2 nodes × 8 H100 GPUs each (16 × H100 total)

Pipeline

Deepspeed-Chat (https://github.com/deepspeedai/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat)

Arguments

Deepspeed Zero 3
Batch size: 1
Gradient accumulation steps: 16
Max sequence length: 10 246 tokens
Learning rate: 9.65 × 10⁻⁶
LR scheduler: cosine
Warmup steps: 500
Seed: 1234

Evaluation

Pipeline

Framework: OpenCompass
Execution: Built-in vLLM pipeline on GitHub

Result

Benchmark	Base Model (Qwen3-32B)	Fine-tuned (SFT)
ARC-c	55.59	50.17
BBH	79.90	51.85
GSM8K	92.87	69.07
MMLU	85.70	73.79
NQ	10.39	11.99

Limitation

NQ was the only benchmark to show a performance gain.

Not recommended for deployment in production services that rely heavily on broad Knowledge-of-the-World abilities.