Dongfu Jiang
commited on
Commit
•
bb45a4c
1
Parent(s):
41fb3d0
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,154 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
datasets:
|
4 |
+
- openai/summarize_from_feedback
|
5 |
+
- openai/webgpt_comparisons
|
6 |
+
- Dahoas/instruct-synthetic-prompt-responses
|
7 |
+
- Anthropic/hh-rlhf
|
8 |
+
- lmsys/chatbot_arena_conversations
|
9 |
+
- openbmb/UltraFeedback
|
10 |
+
metrics:
|
11 |
+
- accuracy
|
12 |
+
tags:
|
13 |
+
- pair-ranker
|
14 |
+
- pair_ranker
|
15 |
+
- reward_model
|
16 |
+
- reward-model
|
17 |
+
- pairrm
|
18 |
+
- pair-rm
|
19 |
+
- RLHF
|
20 |
+
language:
|
21 |
+
- en
|
22 |
---
|
23 |
+
|
24 |
+
Inspired by [DeBERTa Reward Model Series](https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2)
|
25 |
+
`llm-blender/PairRM` is pairranker version finetuned specifically as a reward model using deberta-v3-large.
|
26 |
+
|
27 |
+
- Github: [https://github.com/yuchenlin/LLM-Blender](https://github.com/yuchenlin/LLM-Blender)
|
28 |
+
- Paper: [https://arxiv.org/abs/2306.02561](https://arxiv.org/abs/2306.02561)
|
29 |
+
|
30 |
+
## Statistics
|
31 |
+
|
32 |
+
### Context length
|
33 |
+
| PairRanker type | Source max length | Candidate max length | Total max length |
|
34 |
+
|:-----------------:|:-----------------:|----------------------|------------------|
|
35 |
+
| [pair-ranker](https://huggingface.co/llm-blender/pair-ranker) | 128 | 128 | 384 |
|
36 |
+
| [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224 | 412 | 2048 |
|
37 |
+
|
38 |
+
### Performance
|
39 |
+
|
40 |
+
|
41 |
+
## Usage Example
|
42 |
+
|
43 |
+
### Installation
|
44 |
+
Since PairRanker contains some custom layers and tokens. We recommend use PairRM with our llm-blender code API.
|
45 |
+
- First install `llm-blender`
|
46 |
+
```bash
|
47 |
+
pip install git+https://github.com/yuchenlin/LLM-Blender.git
|
48 |
+
```
|
49 |
+
|
50 |
+
- Then load pairranker with the following code:
|
51 |
+
```python
|
52 |
+
import llm_blender
|
53 |
+
blender = llm_blender.Blender()
|
54 |
+
blender.loadranker("llm-blender/PairRM") # load PairRM
|
55 |
+
```
|
56 |
+
|
57 |
+
### Use case 1: Compare responses (Quality Evaluator)
|
58 |
+
|
59 |
+
- Then you can rank candidate responses with the following function
|
60 |
+
|
61 |
+
```python
|
62 |
+
inputs = ["input1", "input2"]
|
63 |
+
candidates_texts = [["candidate1 for input1", "candidatefor input1"], ["candidate1 for input2", "candidate2 for input2"]]
|
64 |
+
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
|
65 |
+
# ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i
|
66 |
+
```
|
67 |
+
|
68 |
+
- Directly compare two candidate responses
|
69 |
+
```python
|
70 |
+
candidates_A = [cands[0] for cands in candidates]
|
71 |
+
candidates_B = [cands[1] for cands in candidates]
|
72 |
+
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
|
73 |
+
# comparison_results is a list of bool, where element[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]
|
74 |
+
```
|
75 |
+
|
76 |
+
- Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.
|
77 |
+
```python
|
78 |
+
conv1 = [
|
79 |
+
{
|
80 |
+
"content": "hello",
|
81 |
+
"role": "USER"
|
82 |
+
},
|
83 |
+
{
|
84 |
+
"content": "<assistant response>",
|
85 |
+
"role": "ASSISTANT"
|
86 |
+
},
|
87 |
+
...
|
88 |
+
]
|
89 |
+
conv2 = [
|
90 |
+
{
|
91 |
+
"content": "hello",
|
92 |
+
"role": "USER"
|
93 |
+
},
|
94 |
+
{
|
95 |
+
"content": "<assistant response>",
|
96 |
+
"role": "ASSISTANT"
|
97 |
+
},
|
98 |
+
...
|
99 |
+
]
|
100 |
+
comparison_results = blender.compare_conversations([conv1], [conv2])
|
101 |
+
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
|
102 |
+
```
|
103 |
+
|
104 |
+
### Use case 2: Best-of-n sampling (Decoding Enhancing)
|
105 |
+
**Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more at[OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)).
|
106 |
+
|
107 |
+
Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.
|
108 |
+
|
109 |
+
```python
|
110 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
111 |
+
|
112 |
+
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
|
113 |
+
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
|
114 |
+
|
115 |
+
inputs = [...] # your list of inputs
|
116 |
+
system_message = {
|
117 |
+
"role": "system",
|
118 |
+
"content": "You are a friendly chatbot who always responds in the style of a pirate",
|
119 |
+
}
|
120 |
+
messages = [
|
121 |
+
[
|
122 |
+
system_message,
|
123 |
+
{"role": "user", "content": _input},
|
124 |
+
]
|
125 |
+
for _input in zip(inputs)
|
126 |
+
]
|
127 |
+
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
|
128 |
+
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
|
129 |
+
print("### Prompt:")
|
130 |
+
print(prompts[0])
|
131 |
+
print("### best-of-n generations:")
|
132 |
+
print(outputs[0])
|
133 |
+
```
|
134 |
+
|
135 |
+
### Use case 3: RLHF
|
136 |
+
PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4. (See detailed comparison in 🤗[PairRM](https://huggingface.co/llm-blender/PairRM))
|
137 |
+
With a `blender.compare()` function, you can easily apply PairRM to poopular RLHF toolkits like [trl](https://huggingface.co/docs/trl/index).
|
138 |
+
|
139 |
+
**🔥 Check more details on our example jupyter notebook usage: [`blender_usage.ipynb`](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb)**
|
140 |
+
|
141 |
+
|
142 |
+
Learn more in our LLM-Blender Github [README.md](https://github.com/yuchenlin/LLM-Blender#rank-and-fusion)
|
143 |
+
|
144 |
+
## Citation
|
145 |
+
If you are using PairRM in your research, please cite LLM-blender.
|
146 |
+
```bibtex
|
147 |
+
@inproceedings{llm-blender-2023,
|
148 |
+
title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
|
149 |
+
author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
|
150 |
+
booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
|
151 |
+
year = "2023"
|
152 |
+
}
|
153 |
+
|
154 |
+
```
|