llama32-dpo-pairrm / README.md

Update README.md

643ced1 verified 3 months ago

4.45 kB

	---
	license: apache-2.0
	base_model: meta-llama/Llama-3.2-1B-Instruct
	tags:
	- dpo
	- lora
	- peft
	- llama-3.2
	- pairrm
	library_name: peft
	---

	# DPO Fine-Tune of Llama-3.2-1B using PairRM Preferences

	This repository contains the LoRA adapters for a `meta-llama/Llama-3.2-1B-Instruct` model that has been fine-tuned using Direct Preference Optimization (DPO).

	The preference dataset for this training was generated using the `llm-blender/PairRM` reward model, which is designed to rank LLM responses based on quality. This model represents an efficient approach to preference alignment without the need for a separate LLM Judge or human annotation.

	- Preference Dataset: [NilayR/pairrm-preferences-llama32](https://huggingface.co/datasets/NilayR/pairrm-preferences-llama32)

	## Model Details

	### Model Description

	This model is a fine-tuned version of `meta-llama/Llama-3.2-1B-Instruct`. It was trained using DPO on a preference dataset where the 'chosen' and 'rejected' labels were determined by the `llm-blender/PairRM` model. The goal was to align the base model's outputs with PairRM's learned preferences for high-quality, factual, and concise responses.

	- Developed by: NilayR
	- Model type: Causal Language Model
	- Language(s): English
	- License: apache-2.0
	- Finetuned from model: `meta-llama/Llama-3.2-1B-Instruct`

	## How to Get Started with the Model

	To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository.

	```python
	import torch
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

	# Set base model ID and adapter path
	base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
	adapter_id = "NilayR/llama32-dpo-pairrm"

	# Configure BitsAndBytes for 4-bit quantization
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16
	)

	# Load the base model with quantization
	base_model = AutoModelForCausalLM.from_pretrained(
	base_model_id,
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True,
	)

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained(base_model_id)
	tokenizer.pad_token = tokenizer.eos_token

	# Load and apply the PEFT adapters
	model = PeftModel.from_pretrained(base_model, adapter_id)

	# --- Generate a response ---
	prompt = "What are the main differences between renewable and non-renewable energy?"
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": prompt}
	]

	input_ids = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	outputs = model.generate(
	input_ids,
	max_new_tokens=200,
	do_sample=True,
	temperature=0.7,
	top_p=0.95
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response.split("assistant")[-1].strip())

	```
	## Training Details

	### Training Data

	The model was trained on a preference dataset generated using the `llm-blender/PairRM` model.

	* Data Generation Process:
	1. Instructions: 50 instructions were extracted from the LIMA dataset.
	2. Response Generation: The base `Llama-3.2-1B` model generated 5 diverse responses for each instruction.
	3. Preference Labeling: The `llm-blender/PairRM` ranker scored all 5 responses for each instruction. The highest-ranked response was selected as 'chosen' and the lowest-ranked as 'rejected', resulting in 50 preference pairs.

	### Training Procedure

	The model was trained for one epoch using the TRL library's `DPOTrainer`.

	#### Training Hyperparameters

	* Framework: `trl.DPOTrainer`
	* Epochs: 1
	* Batch Size: 1
	* Gradient Accumulation Steps: 4 (Effective Batch Size: 4)
	* Optimizer: `paged_adamw_8bit`
	* Learning Rate: 5e-5
	* LR Scheduler: `cosine` with a warmup ratio of 0.1
	* DPO Beta (β): 0.1
	* Final Training Loss: `0.6872`

	#### LoRA Configuration

	* Rank (`r`): 16
	* Alpha (`lora_alpha`): 32
	* Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
	* Dropout: 0.05

	### Compute Infrastructure

	* Hardware: 1x NVIDIA A100 40GB GPU
	* Cloud Provider: Google Colab
	* Software: `transformers`, `peft`, `trl`, `bitsandbytes`

	-----


	```
	```