jasperyeoh2
/

mistral-dpo-peft

preference-optimization

instruction-tuning

Model card Files Files and versions Community

jasperyeoh2 commited on Apr 25

Commit

04e621e

·

verified ·

1 Parent(s): cae8908

Update README.md

Files changed (1) hide show

README.md +62 -8

README.md CHANGED Viewed

@@ -1,10 +1,64 @@
 ---
-datasets:
-- jasperyeoh2/pairrm-preference-dataset
-- GAIR/lima
-base_model:
-- mistralai/Mistral-7B-Instruct-v0.2
 tags:
-- PEFT
-- DPO
----

 ---
+license: mit
 tags:
+- mistral
+- dpo
+- preference-optimization
+- peft
+- lora
+- instruction-tuning
+- alpaca-eval
+---
+# 🧠 Mistral-7B DPO Fine-Tuned Adapter (PEFT)
+This repository hosts a PEFT adapter trained via **Direct Preference Optimization (DPO)** using **LoRA** on top of [`mistralai/Mistral-7B-Instruct-v0.2`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). The preference dataset was generated with **PairRM**, a reward model capable of ranking responses with strong human alignment.
+---
+## 📦 Model Details
+| Attribute         | Value                                                      |
+|------------------|------------------------------------------------------------|
+| **Base Model**    | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) |
+| **Training Method** | DPO (Direct Preference Optimization)                     |
+| **Adapter Type**  | PEFT - [LoRA](https://github.com/microsoft/LoRA)          |
+| **Preference Model** | [PairRM](https://huggingface.co/llm-blender/PairRM)      |
+| **Frameworks**    | HuggingFace 🤗 Transformers + TRL + PEFT                  |
+| **Compute**       | 4 × A800 GPUs                                              |
+---
+## 📚 Dataset
+- **Source**: [GAIR/LIMA](https://huggingface.co/datasets/GAIR/lima)
+- **Generation Process**:
+  - 50 instructions sampled from LIMA
+  - Each instruction was completed 5 times using the base model
+  - Pairwise preferences generated using [`llm-blender/PairRM`](https://huggingface.co/llm-blender/PairRM)
+- **Final Format**: DPO-formatted JSONL
+📁 Dataset Repository: [**jasperyeoh2/mistral-dpo-dataset**](https://huggingface.co/datasets/jasperyeoh2/mistral-dpo-dataset)
+---
+## 🧪 Evaluation
+- 10 **unseen instructions** from the LIMA test split were used for evaluation
+- **Completions from base vs. DPO model** were compared side-by-side
+- DPO model demonstrated better **politeness**, **clarity**, and **alignment**
+---
+## 🚀 Usage (with PEFT)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+base = "mistralai/Mistral-7B-Instruct-v0.2"
+adapter = "jasperyeoh2/mistral-dpo-peft"
+tokenizer = AutoTokenizer.from_pretrained(base)
+model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.float16, device_map="auto")
+model = PeftModel.from_pretrained(model, adapter)