MoE Training

Completed the first few runs of the MoE weights.

Training

Run this command

python trl/trl/scripts/sft.py --config recipes/config_000.yaml

This is the config

model_name_or_path: Qwen/Qwen3-30B-A3B

# dataset
dataset_name: burtenshaw/tulu-3-sft-personas-code-no-prompt
dataset_num_proc: 6
text_column: messages
eos_token: '<|im_end|>'

# training
learning_rate: 2.0e-5
num_train_epochs: 1
packing: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
gradient_checkpointing: true
logging_steps: 1
max_length: 2048
warmup_ratio: 0.03
lr_scheduler_type: 'cosine'
bf16: true
bf16_full_eval: true
fp16: false

Training

Evaluation

The results stay mostly within significance on all benchmarks except live code bench. This make sense because we're training the model to stop thinking with this dataset. This decreased evaluation time by 35%.

Task	Metric	Qwen/Qwen3-30B-A3B	config_000
ARC Challenge	acc_norm	0.3874	0.3848
Hellaswag	acc_norm	0.6483	0.6747
MMLU (Average)	acc	0.3271	0.3581
Winogrande	acc	0.5943	0.5975
LCB Code Gen v4	codegen_pass@1:16	0.3224	0.2269

burtenshaw
/

Qwen3-Code-Lite

Experiment Log on a lightweight python pro coder

MoE Training

Training

Training

Evaluation