DPO Fine-Tuned Adapter - LLM Judge Dataset
π§ Model
- Base:
meta-llama/Llama-3.2-1B-Instruct
- Fine-tuned using TRL's
DPOTrainer
with the LLM Judge preference dataset (50 pairs)
βοΈ Training Parameters
Parameter |
Value |
Learning Rate |
5e-5 |
Batch Size |
4 |
Epochs |
3 |
Beta (DPO regularizer) |
0.1 |
Max Input Length |
1024 tokens |
Max Prompt Length |
512 tokens |
Padding Token |
eos_token |
π¦ Dataset
- Source:
llm_judge_preferences.csv
- Size: 50 human-labeled pairs with
prompt
, chosen
, and rejected
columns
π Output
- Adapter saved and uploaded as
Likhith003/dpo-llmjudge-lora-adapter