README.md · llm-blender/PairRM at 94512bac9a92bdbf1eacee520cae16c275203eca

metadata

license: mit
datasets:
  - openai/summarize_from_feedback
  - openai/webgpt_comparisons
  - Dahoas/instruct-synthetic-prompt-responses
  - Anthropic/hh-rlhf
  - lmsys/chatbot_arena_conversations
  - openbmb/UltraFeedback
metrics:
  - accuracy
tags:
  - pair-ranker
  - pair_ranker
  - reward_model
  - reward-model
  - pairrm
  - pair-rm
  - RLHF
language:
  - en

Inspired by DeBERTa Reward Model Series llm-blender/PairRM is pairranker version finetuned specifically as a reward model using deberta-v3-large.

Github: https://github.com/yuchenlin/LLM-Blender
Paper: https://arxiv.org/abs/2306.02561
Space Demo: https://huggingface.co/spaces/llm-blender/LLM-Blender

Usage Example

Installation

Since PairRanker contains some custom layers and tokens. We recommend use PairRM with our llm-blender code API.

First install llm-blender

pip install git+https://github.com/yuchenlin/LLM-Blender.git

Then load pairranker with the following code:

import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM

Use case 1: Compare responses (Quality Evaluator)

Then you can rank candidate responses with the following function

inputs = ["input1", "input2"]
candidates_texts = [["candidate1 for input1", "candidatefor input1"], ["candidate1 for input2", "candidate2 for input2"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
# ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i

Directly compare two candidate responses

candidates_A = [cands[0] for cands in candidates]
candidates_B = [cands[1] for cands in candidates]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where element[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]

Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.

conv1 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "<assistant response>",
        "role": "ASSISTANT"
    },
    ...
]
conv2 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "<assistant response>",
        "role": "ASSISTANT"
    },
    ...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2

Use case 2: Best-of-n sampling (Decoding Enhancing)

Best-of-n Sampling, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more atOpenAI WebGPT section 3.2 and OpenAI Blog).

Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")

inputs = [...] # your list of inputs
system_message = {
    "role": "system",
    "content": "You are a friendly chatbot who always responds in the style of a pirate",
}
messages = [
    [   
        system_message,
        {"role": "user", "content": _input},
    ]
    for _input in zip(inputs)
]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
print("### Prompt:")
print(prompts[0])
print("### best-of-n generations:")
print(outputs[0])

Use case 3: RLHF

PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4. We believe PairRM will power the alignment of LLM in an efficient and effective way. With a blender.compare() function, you can easily apply PairRM to poopular RLHF toolkits like trl.

🔥 Check more details on our example jupyter notebook usage: blender_usage.ipynb

Learn more in our LLM-Blender Github README.md

Statistics

Context length

PairRanker type	Source max length	Candidate max length	Total max length
pair-ranker	128	128	384
PairRM (This model)	1224	412	2048

Performance

We test the pairwise comparison on

Auto-J Pairwise test data performance

Model	Summ	Exam	Code	Rewriting	Crea W	Func W	Comm	NLP	Overall
Closed -source Models
ChatGPT	33.3	40.3	36.6	31.6	48.2	40.4	47.6	45.8	42.7
Claude -2	30.6	36.1	41.7	34.2	48.1	42.5	40.6	48.5	42.4
GPT -4	59.7	51.4	69.2	58.3	66.7	60.4	58.3	65.2	61.9
Open -source Models
SteamSHP	33.3	29.2	26.7	33.3	40.7	31.3	51.4	51.9	40.6
PandaLM	29.2	33.3	31.7	23.3	43.5	32.9	44.8	48.9	38.9
LLaMA -2-Chat -13B	20.8	27.8	19.2	20	31.5	27.5	35.8	31.8	29
Vicuna -13B-v1.5	30.6	23.6	35	28.3	36.1	37.5	45.5	39.8	37.3
WizardLM -13B-v1.2	22.2	20.8	32.5	19.2	28.7	25.4	29.2	33	27.8
LLAMA -2-chat -70B	34.7	33.3	36.7	35.8	51.4	54.2	47.2	47.7	45.9
AUTO -J (13b)	45.8	38.9	59.2	47.5	54.6	57.1	58	57.6	54.8
PairRM (0.4b)	56.94	52.78	58.33	55.83	61.57	59.17	57.64	62.5	59.05

HHH-Alignment and MT-bench human judgements

Evaluator LM	HHH ALIGNMENT					MT BENCH HUMAN JUDG .
	Help .	Harm .	Hon .	Other	Total Avg .	Human Preference
RANDOM	50	50	50	50	50	34.26
STANFORDNLP REWARD MODEL	69.49	60.34	52.46	51.16	58.82	44.79
ALMOST REWARD MODEL	74.58	67.24	78.69	86.05	76.02	49.9
LLAMA2 -CHAT 7B	66.1	81.03	70.49	74.42	72.85	51.78
LLAMA2 -CHAT 13B	74.58	87.93	55.74	79.07	73.76	52.34
LLAMA2 -CHAT 70B	66.1	89.66	67.21	74.42	74.21	53.67
LLAMA2 -CHAT 13B+COARSE .	68.74	68.97	65.57	67.44	67.42	46.89
GPT -3.5-TURBO -0613	76.27	87.93	67.21	86.05	78.73	57.12
PROMETHEUS 7B	69.49	84.48	78.69	90.7	80.09	55.14
PROMETHEUS 13B	81.36	82.76	75.41	76.74	79.19	57.72
PairRM (0.4b)	84.75	84.48	80.33	90.7	84.62	59
GPT -4-0613	91.53	93.1	85.25	83.72	88.69	63.87

While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!

Two reasons to attribute:

Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page)

Citation

If you are using PairRM in your research, please cite LLM-blender.

@inproceedings{llm-blender-2023,
    title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
    author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
    year = "2023"
}