Llama-3.2-1B Iterative DPO (Self-Rewarding)
This model is trained using Iterative DPO (Self-Rewarding Language Models approach), where the model acts as its own judge to continuously improve over multiple iterations.
Training Details
- Base Model: meta-llama/Llama-3.2-1B-Instruct
- Training Method: Iterative DPO with Self-Rewarding
- Number of Iterations: 2
- Initial Dataset: 15 LLM Judge preference pairs
- Iteration 1 Dataset: 25 total pairs (15 initial + 10 self-judged)
- Iteration 2 Dataset: 33 total pairs (25 + 8 self-judged)
- LoRA Configuration: r=16, alpha=32
- Learning Rate: 3e-5 (iterations)
Iterative Training Process
- Iteration 0: Train on LLM Judge preferences (baseline DPO)
- Iteration 1:
- Model generates new responses
- Model judges its own responses (self-rewarding)
- Train on accumulated preferences
- Iteration 2:
- Repeat self-judging with improved model
- Train on all accumulated preferences
Self-Rewarding Approach
The model progressively refines its own judgment criteria through:
- Self-evaluation of generated responses
- Accumulation of diverse preference data
- Compound learning from multiple iterations
Performance
Compared to base DPO model:
- Improved response quality through self-refinement
- Better alignment with implicit quality standards
- Enhanced consistency in output quality
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Zickl/llama32-1b-iterative-dpo")
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
# Generate
messages = [{"role": "user", "content": "Your question here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations
- Self-judgment may amplify certain biases
- Limited to model's inherent capability ceiling
- Requires careful monitoring for mode collapse
- May over-optimize for specific patterns
Related Models
- Base DPO Model: Zickl/llama32-1b-dpo-llm-judge
- Preference Datasets: Zickl/dpo-preference-datasets
Citation
@misc{llama32-iterative-dpo,
author = {Zickl},
title = {Llama-3.2-1B Iterative DPO (Self-Rewarding)},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/Zickl/llama32-1b-iterative-dpo}
}
References
- Downloads last month
- 22
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Zickl/llama32-1b-iterative-dpo
Base model
meta-llama/Llama-3.2-1B-Instruct