jasperyeoh2 commited on
Commit
04e621e
Β·
verified Β·
1 Parent(s): cae8908

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -8
README.md CHANGED
@@ -1,10 +1,64 @@
1
  ---
2
- datasets:
3
- - jasperyeoh2/pairrm-preference-dataset
4
- - GAIR/lima
5
- base_model:
6
- - mistralai/Mistral-7B-Instruct-v0.2
7
  tags:
8
- - PEFT
9
- - DPO
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
 
 
 
 
3
  tags:
4
+ - mistral
5
+ - dpo
6
+ - preference-optimization
7
+ - peft
8
+ - lora
9
+ - instruction-tuning
10
+ - alpaca-eval
11
+ ---
12
+
13
+ # 🧠 Mistral-7B DPO Fine-Tuned Adapter (PEFT)
14
+
15
+ This repository hosts a PEFT adapter trained via **Direct Preference Optimization (DPO)** using **LoRA** on top of [`mistralai/Mistral-7B-Instruct-v0.2`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). The preference dataset was generated with **PairRM**, a reward model capable of ranking responses with strong human alignment.
16
+
17
+ ---
18
+
19
+ ## πŸ“¦ Model Details
20
+
21
+ | Attribute | Value |
22
+ |------------------|------------------------------------------------------------|
23
+ | **Base Model** | [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) |
24
+ | **Training Method** | DPO (Direct Preference Optimization) |
25
+ | **Adapter Type** | PEFT - [LoRA](https://github.com/microsoft/LoRA) |
26
+ | **Preference Model** | [PairRM](https://huggingface.co/llm-blender/PairRM) |
27
+ | **Frameworks** | HuggingFace πŸ€— Transformers + TRL + PEFT |
28
+ | **Compute** | 4 Γ— A800 GPUs |
29
+
30
+ ---
31
+
32
+ ## πŸ“š Dataset
33
+
34
+ - **Source**: [GAIR/LIMA](https://huggingface.co/datasets/GAIR/lima)
35
+ - **Generation Process**:
36
+ - 50 instructions sampled from LIMA
37
+ - Each instruction was completed 5 times using the base model
38
+ - Pairwise preferences generated using [`llm-blender/PairRM`](https://huggingface.co/llm-blender/PairRM)
39
+ - **Final Format**: DPO-formatted JSONL
40
+
41
+ πŸ“ Dataset Repository: [**jasperyeoh2/mistral-dpo-dataset**](https://huggingface.co/datasets/jasperyeoh2/mistral-dpo-dataset)
42
+
43
+ ---
44
+
45
+ ## πŸ§ͺ Evaluation
46
+
47
+ - 10 **unseen instructions** from the LIMA test split were used for evaluation
48
+ - **Completions from base vs. DPO model** were compared side-by-side
49
+ - DPO model demonstrated better **politeness**, **clarity**, and **alignment**
50
+
51
+ ---
52
+
53
+ ## πŸš€ Usage (with PEFT)
54
+
55
+ ```python
56
+ from transformers import AutoModelForCausalLM, AutoTokenizer
57
+ from peft import PeftModel
58
+
59
+ base = "mistralai/Mistral-7B-Instruct-v0.2"
60
+ adapter = "jasperyeoh2/mistral-dpo-peft"
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained(base)
63
+ model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.float16, device_map="auto")
64
+ model = PeftModel.from_pretrained(model, adapter)