MNLP M2 DPO Model — Qwen3-0.6B Fine-Tuned with Direct Preference Optimization
This repository contains a Direct Preference Optimization (DPO) model built on top of a supervised fine-tuned version of Qwen/Qwen3-0.6B-Base
, as part of the MNLP M2 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences.
Model Description
- Base Model:
Qwen/Qwen3-0.6B-Base
- SFT Checkpoint:
Tandogan/MNLP_M2_SFT
- DPO Dataset:
Tandogan/MNLP_M2_dpo_dataset
- Libraries: Unsloth, TRL
Training Procedure
Supervised Fine-Tuning (SFT)
- Dataset:
Tandogan/sft_dataset_final_train
(Alpaca-style prompt–completion pairs) - Max sequence length: 2048
- Epochs: 4
- Optimizer: AdamW (learning rate =
3e-5
, weight decay =0
) - Precision: bf16
- Batch size: 2 (gradient accumulation = 4)
- Scheduler: Linear with 1% warmup
- Eval & Checkpointing: Every epoch
Direct Preference Optimization (DPO)
Two DPO fine-tuning experiments were run:
1. From Base Model (Qwen3-0.6B-Base
)
2. From SFT Model (Tandogan/MNLP_M2_SFT
)
- Dataset:
Tandogan/MNLP_M2_dpo_dataset
- Max sequence length: 2048 (prompt + completions truncated to 1024 each)
- Epochs: 4
- Optimizer: AdamW (learning rate =
2e-6
, weight decay =0
) - Precision: bf16
- Batch size: 2 (gradient accumulation = 4)
- Scheduler: Cosine with 1% warmup
- DPO Beta: 0.1
- Eval & Checkpointing: Every epoch
- Monitoring: Weights & Biases (WandB)
- Best Epoch Selection: Based on validation loss
Intended Use
This model is intended for research and experimentation with preference-based alignment and reward modeling. It is not production-ready and may produce hallucinated, biased, or unsafe outputs. Please evaluate carefully for downstream tasks.
How to Use
You can use the model with the transformers
and trl
libraries for inference or evaluation:
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M2_dpo_model").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M2_dpo_model")
prompt = "Explain recursion in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 84
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for Tandogan/MNLP_M2_dpo_model
Base model
Qwen/Qwen3-0.6B-Base