llama32-dpo-pairrm / README.md
NilayR's picture
Update README.md
643ced1 verified
---
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- dpo
- lora
- peft
- llama-3.2
- pairrm
library_name: peft
---
# DPO Fine-Tune of Llama-3.2-1B using PairRM Preferences
This repository contains the LoRA adapters for a `meta-llama/Llama-3.2-1B-Instruct` model that has been fine-tuned using Direct Preference Optimization (DPO).
The preference dataset for this training was generated using the `llm-blender/PairRM` reward model, which is designed to rank LLM responses based on quality. This model represents an efficient approach to preference alignment without the need for a separate LLM Judge or human annotation.
- **Preference Dataset:** [NilayR/pairrm-preferences-llama32](https://huggingface.co/datasets/NilayR/pairrm-preferences-llama32)
## Model Details
### Model Description
This model is a fine-tuned version of `meta-llama/Llama-3.2-1B-Instruct`. It was trained using DPO on a preference dataset where the 'chosen' and 'rejected' labels were determined by the `llm-blender/PairRM` model. The goal was to align the base model's outputs with PairRM's learned preferences for high-quality, factual, and concise responses.
- **Developed by:** NilayR
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** apache-2.0
- **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct`
## How to Get Started with the Model
To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository.
```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Set base model ID and adapter path
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_id = "NilayR/llama32-dpo-pairrm"
# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load the base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token
# Load and apply the PEFT adapters
model = PeftModel.from_pretrained(base_model, adapter_id)
# --- Generate a response ---
prompt = "What are the main differences between renewable and non-renewable energy?"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_p=0.95
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("assistant")[-1].strip())
```
## Training Details
### Training Data
The model was trained on a preference dataset generated using the `llm-blender/PairRM` model.
* **Data Generation Process:**
1. **Instructions:** 50 instructions were extracted from the LIMA dataset.
2. **Response Generation:** The base `Llama-3.2-1B` model generated 5 diverse responses for each instruction.
3. **Preference Labeling:** The `llm-blender/PairRM` ranker scored all 5 responses for each instruction. The highest-ranked response was selected as 'chosen' and the lowest-ranked as 'rejected', resulting in **50 preference pairs**.
### Training Procedure
The model was trained for one epoch using the TRL library's `DPOTrainer`.
#### Training Hyperparameters
* **Framework:** `trl.DPOTrainer`
* **Epochs:** 1
* **Batch Size:** 1
* **Gradient Accumulation Steps:** 4 (Effective Batch Size: 4)
* **Optimizer:** `paged_adamw_8bit`
* **Learning Rate:** 5e-5
* **LR Scheduler:** `cosine` with a warmup ratio of 0.1
* **DPO Beta (β):** 0.1
* **Final Training Loss:** `0.6872`
#### LoRA Configuration
* **Rank (`r`):** 16
* **Alpha (`lora_alpha`):** 32
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
* **Dropout:** 0.05
### Compute Infrastructure
* **Hardware:** 1x NVIDIA A100 40GB GPU
* **Cloud Provider:** Google Colab
* **Software:** `transformers`, `peft`, `trl`, `bitsandbytes`
-----
```
```