Update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,64 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
- jasperyeoh2/pairrm-preference-dataset
|
4 |
-
- GAIR/lima
|
5 |
-
base_model:
|
6 |
-
- mistralai/Mistral-7B-Instruct-v0.2
|
7 |
tags:
|
8 |
-
-
|
9 |
-
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: mit
|
|
|
|
|
|
|
|
|
3 |
tags:
|
4 |
+
- mistral
|
5 |
+
- dpo
|
6 |
+
- preference-optimization
|
7 |
+
- peft
|
8 |
+
- lora
|
9 |
+
- instruction-tuning
|
10 |
+
- alpaca-eval
|
11 |
+
---
|
12 |
+
|
13 |
+
# π§ Mistral-7B DPO Fine-Tuned Adapter (PEFT)
|
14 |
+
|
15 |
+
This repository hosts a PEFT adapter trained via **Direct Preference Optimization (DPO)** using **LoRA** on top of [`mistralai/Mistral-7B-Instruct-v0.2`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). The preference dataset was generated with **PairRM**, a reward model capable of ranking responses with strong human alignment.
|
16 |
+
|
17 |
+
---
|
18 |
+
|
19 |
+
## π¦ Model Details
|
20 |
+
|
21 |
+
| Attribute | Value |
|
22 |
+
|------------------|------------------------------------------------------------|
|
23 |
+
| **Base Model** | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) |
|
24 |
+
| **Training Method** | DPO (Direct Preference Optimization) |
|
25 |
+
| **Adapter Type** | PEFT - [LoRA](https://github.com/microsoft/LoRA) |
|
26 |
+
| **Preference Model** | [PairRM](https://huggingface.co/llm-blender/PairRM) |
|
27 |
+
| **Frameworks** | HuggingFace π€ Transformers + TRL + PEFT |
|
28 |
+
| **Compute** | 4 Γ A800 GPUs |
|
29 |
+
|
30 |
+
---
|
31 |
+
|
32 |
+
## π Dataset
|
33 |
+
|
34 |
+
- **Source**: [GAIR/LIMA](https://huggingface.co/datasets/GAIR/lima)
|
35 |
+
- **Generation Process**:
|
36 |
+
- 50 instructions sampled from LIMA
|
37 |
+
- Each instruction was completed 5 times using the base model
|
38 |
+
- Pairwise preferences generated using [`llm-blender/PairRM`](https://huggingface.co/llm-blender/PairRM)
|
39 |
+
- **Final Format**: DPO-formatted JSONL
|
40 |
+
|
41 |
+
π Dataset Repository: [**jasperyeoh2/mistral-dpo-dataset**](https://huggingface.co/datasets/jasperyeoh2/mistral-dpo-dataset)
|
42 |
+
|
43 |
+
---
|
44 |
+
|
45 |
+
## π§ͺ Evaluation
|
46 |
+
|
47 |
+
- 10 **unseen instructions** from the LIMA test split were used for evaluation
|
48 |
+
- **Completions from base vs. DPO model** were compared side-by-side
|
49 |
+
- DPO model demonstrated better **politeness**, **clarity**, and **alignment**
|
50 |
+
|
51 |
+
---
|
52 |
+
|
53 |
+
## π Usage (with PEFT)
|
54 |
+
|
55 |
+
```python
|
56 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
57 |
+
from peft import PeftModel
|
58 |
+
|
59 |
+
base = "mistralai/Mistral-7B-Instruct-v0.2"
|
60 |
+
adapter = "jasperyeoh2/mistral-dpo-peft"
|
61 |
+
|
62 |
+
tokenizer = AutoTokenizer.from_pretrained(base)
|
63 |
+
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.float16, device_map="auto")
|
64 |
+
model = PeftModel.from_pretrained(model, adapter)
|