Llama-3.2-1B Iterative DPO (Self-Rewarding)

This model is trained using Iterative DPO (Self-Rewarding Language Models approach), where the model acts as its own judge to continuously improve over multiple iterations.

Training Details

  • Base Model: meta-llama/Llama-3.2-1B-Instruct
  • Training Method: Iterative DPO with Self-Rewarding
  • Number of Iterations: 2
  • Initial Dataset: 15 LLM Judge preference pairs
  • Iteration 1 Dataset: 25 total pairs (15 initial + 10 self-judged)
  • Iteration 2 Dataset: 33 total pairs (25 + 8 self-judged)
  • LoRA Configuration: r=16, alpha=32
  • Learning Rate: 3e-5 (iterations)

Iterative Training Process

  1. Iteration 0: Train on LLM Judge preferences (baseline DPO)
  2. Iteration 1:
    • Model generates new responses
    • Model judges its own responses (self-rewarding)
    • Train on accumulated preferences
  3. Iteration 2:
    • Repeat self-judging with improved model
    • Train on all accumulated preferences

Self-Rewarding Approach

The model progressively refines its own judgment criteria through:

  • Self-evaluation of generated responses
  • Accumulation of diverse preference data
  • Compound learning from multiple iterations

Performance

Compared to base DPO model:

  • Improved response quality through self-refinement
  • Better alignment with implicit quality standards
  • Enhanced consistency in output quality

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Zickl/llama32-1b-iterative-dpo")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

# Generate
messages = [{"role": "user", "content": "Your question here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • Self-judgment may amplify certain biases
  • Limited to model's inherent capability ceiling
  • Requires careful monitoring for mode collapse
  • May over-optimize for specific patterns

Related Models

Citation

@misc{llama32-iterative-dpo,
  author = {Zickl},
  title = {Llama-3.2-1B Iterative DPO (Self-Rewarding)},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Zickl/llama32-1b-iterative-dpo}
}

References

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Zickl/llama32-1b-iterative-dpo

Adapter
(498)
this model