|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: meta-llama/Llama-3.2-1B-Instruct |
|
|
tags: |
|
|
- dpo |
|
|
- lora |
|
|
- peft |
|
|
- llama-3.2 |
|
|
- pairrm |
|
|
library_name: peft |
|
|
--- |
|
|
|
|
|
# DPO Fine-Tune of Llama-3.2-1B using PairRM Preferences |
|
|
|
|
|
This repository contains the LoRA adapters for a `meta-llama/Llama-3.2-1B-Instruct` model that has been fine-tuned using Direct Preference Optimization (DPO). |
|
|
|
|
|
The preference dataset for this training was generated using the `llm-blender/PairRM` reward model, which is designed to rank LLM responses based on quality. This model represents an efficient approach to preference alignment without the need for a separate LLM Judge or human annotation. |
|
|
|
|
|
- **Preference Dataset:** [NilayR/pairrm-preferences-llama32](https://huggingface.co/datasets/NilayR/pairrm-preferences-llama32) |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model is a fine-tuned version of `meta-llama/Llama-3.2-1B-Instruct`. It was trained using DPO on a preference dataset where the 'chosen' and 'rejected' labels were determined by the `llm-blender/PairRM` model. The goal was to align the base model's outputs with PairRM's learned preferences for high-quality, factual, and concise responses. |
|
|
|
|
|
- **Developed by:** NilayR |
|
|
- **Model type:** Causal Language Model |
|
|
- **Language(s):** English |
|
|
- **License:** apache-2.0 |
|
|
- **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct` |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from peft import PeftModel |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
|
|
|
# Set base model ID and adapter path |
|
|
base_model_id = "meta-llama/Llama-3.2-1B-Instruct" |
|
|
adapter_id = "NilayR/llama32-dpo-pairrm" |
|
|
|
|
|
# Configure BitsAndBytes for 4-bit quantization |
|
|
bnb_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
|
) |
|
|
|
|
|
# Load the base model with quantization |
|
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
|
base_model_id, |
|
|
quantization_config=bnb_config, |
|
|
device_map="auto", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
# Load the tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
|
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
|
|
# Load and apply the PEFT adapters |
|
|
model = PeftModel.from_pretrained(base_model, adapter_id) |
|
|
|
|
|
# --- Generate a response --- |
|
|
prompt = "What are the main differences between renewable and non-renewable energy?" |
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
|
{"role": "user", "content": prompt} |
|
|
] |
|
|
|
|
|
input_ids = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
add_generation_prompt=True, |
|
|
return_tensors="pt" |
|
|
).to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
input_ids, |
|
|
max_new_tokens=200, |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.95 |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response.split("assistant")[-1].strip()) |
|
|
|
|
|
``` |
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on a preference dataset generated using the `llm-blender/PairRM` model. |
|
|
|
|
|
* **Data Generation Process:** |
|
|
1. **Instructions:** 50 instructions were extracted from the LIMA dataset. |
|
|
2. **Response Generation:** The base `Llama-3.2-1B` model generated 5 diverse responses for each instruction. |
|
|
3. **Preference Labeling:** The `llm-blender/PairRM` ranker scored all 5 responses for each instruction. The highest-ranked response was selected as 'chosen' and the lowest-ranked as 'rejected', resulting in **50 preference pairs**. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
The model was trained for one epoch using the TRL library's `DPOTrainer`. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
* **Framework:** `trl.DPOTrainer` |
|
|
* **Epochs:** 1 |
|
|
* **Batch Size:** 1 |
|
|
* **Gradient Accumulation Steps:** 4 (Effective Batch Size: 4) |
|
|
* **Optimizer:** `paged_adamw_8bit` |
|
|
* **Learning Rate:** 5e-5 |
|
|
* **LR Scheduler:** `cosine` with a warmup ratio of 0.1 |
|
|
* **DPO Beta (β):** 0.1 |
|
|
* **Final Training Loss:** `0.6872` |
|
|
|
|
|
#### LoRA Configuration |
|
|
|
|
|
* **Rank (`r`):** 16 |
|
|
* **Alpha (`lora_alpha`):** 32 |
|
|
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
|
|
* **Dropout:** 0.05 |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
* **Hardware:** 1x NVIDIA A100 40GB GPU |
|
|
* **Cloud Provider:** Google Colab |
|
|
* **Software:** `transformers`, `peft`, `trl`, `bitsandbytes` |
|
|
|
|
|
----- |
|
|
|
|
|
|
|
|
``` |
|
|
``` |