Storm-7B / README.md
jieliu's picture
update model
9822b8e verified
|
raw
history blame
5.65 kB
metadata
license: apache-2.0
library_name: transformers
tags:
  - storm
  - mistral
  - openchat
  - RLAIF
  - reward model

Storm-7B

Introduction

We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the AlpacaEval 2.0 leaderboard.

Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity.

A snapshot of the AlpacaEval 2.0 leaderboard (Single Model, 2024/6/18) is listed below:

LC Win Rate Win Rate
GPT-4 Turbo (04/09) 55.0% 46.1%
GPT-4 Turbo (04/09) 55.0% 46.1%
GPT-4 Turbo (04/09) 55.0% 46.1%
GPT-4 Turbo (04/09) 55.0% 46.1%
GPT-4 Preview (11/06) 50.0% 50.0%
Storm-7B 48.9% 52.5%
Nanbeige Plus Chat v0.1 44.5% 56.7%
Qwen1.5 110B Chat 43.9% 33.8%
Aligner 2B+Claude 3 Opus 41.8% 34.5%
Claude 3 Opus (02/29) 40.5% 29.1%
GPT-4 38.1% 23.6%
openchat-3.5-0106 15.4% 10.1%

Please refer to the leaderboard webpage for up-to-date results.

We also conducted preliminary evaluations on other benchmarks and observed no significant degradation.

ARC HellaSwag MMLU TruthfulQA Winogrande Avg.
Storm-7B 67.58 80.97 62.21 57.24 80.51 69.70
openchat-3.5-0106 66.38 83.00 63.47 52.55 81.06 69.29
internlm2-7b 58.02 81.24 65.24 48.73 83.82 67.41
gemma-7B 61.09 82.20 64.56 44.79 79.01 66.33
Yi-9B 61.18 78.82 70.06 42.45 77.51 66.00
Meta-Llama-3-8B 59.47 82.09 66.69 43.90 77.35 65.90
Mistral-7B-v0.1 59.98 83.31 64.16 42.15 78.37 65.59
Qwen-7b 51.37 78.47 59.84 47.79 72.69 62.03

Uses

Our model uses the same chat template as Openchat-3.5-0106. A sample code snippet for inference using our model is provided below.

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
model.eval().requires_grad_(False)

def generate_response(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(
        input_ids,
        max_length=2048,
        do_sample=True,
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    response_ids = outputs[0]
    response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
    return response_text

prompt = "How does a telescope work?"
input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(input_prompt)
print("Response:", response_text)

Scripts

You can reproduce our results on AlphaEval 2.0 using the script provided below.

git clone https://github.com/tatsu-lab/alpaca_eval.git
cd alpaca_eval
pip install -e .
export OPENAI_API_KEY=<your_api_key>
alpaca_eval evaluate_from_model --model_configs 'Storm-7B'

Limitations

Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.

Citation

@misc{liu2024storm,
    title = {Storm-7B: An Empirical Study of Iterative Direct Preference Optimization},
    url = {},
    author = {Jie Liu and Zhanhui Zhou and Chao Yang and Han-Sen Zhong and Wanli Ouyang},
    month = {April},
    year = {2024}
}