Storm-7B / README.md

update model

9822b8e verified 5 months ago

5.65 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- storm
	- mistral
	- openchat
	- RLAIF
	- reward model
	---

	# Storm-7B
	- Developed by: [Jie Liu](https://jieliu.site/) \\(^{1,2}\\), [Zhanhui Zhou](https://scholar.google.com/citations?user=SbACfYQAAAAJ&hl=zh-CN) \\(^{2}\\), [Jiaheng Liu](https://liujiaheng.github.io/) \\(^{2}\\), [Xingyuan Bu](https://scholar.google.com.hk/citations?user=cqYaRhUAAAAJ&hl=zh-CN) \\(^{2}\\), [Chao Yang](https://scholar.google.com/citations?user=5KRbHPMAAAAJ&hl=zh-CN) \\(^{2}\\), [Han-Sen Zhong](https://scholar.google.com.hk/citations?user=X_ZfX8sAAAAJ&hl=zh-CN) \\(^{\dag 2}\\), [Wanli Ouyang](https://wlouyang.github.io/) \\(^{1,2}\\).
	- \\(^{1}\\)MMLab, The Chinese University of Hong Kong &ensp; \\(^{2}\\)Shanghai AI Laboratory

	## Introduction

	We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the [AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) leaderboard.

	Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity.

	A snapshot of the AlpacaEval 2.0 leaderboard (Single Model, 2024/6/18) is listed below:

	\| \| LC Win Rate \| Win Rate \|
	\| :----------------------: \| :-------------: \| :----------: \|
	\| GPT-4 Turbo (04/09) \| 55.0% \| 46.1% \|
	\| GPT-4 Turbo (04/09) \| 55.0% \| 46.1% \|
	\| GPT-4 Turbo (04/09) \| 55.0% \| 46.1% \|
	\| GPT-4 Turbo (04/09) \| 55.0% \| 46.1% \|
	\| GPT-4 Preview (11/06) \| 50.0% \| 50.0% \|
	\| Storm-7B \| 48.9% \| 52.5% \|
	\| Nanbeige Plus Chat v0.1 \| 44.5% \| 56.7% \|
	\| Qwen1.5 110B Chat \| 43.9% \| 33.8% \|
	\| Aligner 2B+Claude 3 Opus \| 41.8% \| 34.5% \|
	\| Claude 3 Opus (02/29) \| 40.5% \| 29.1% \|
	\| GPT-4 \| 38.1% \| 23.6% \|
	\| openchat-3.5-0106 \| 15.4% \| 10.1% \|

	Please refer to the [leaderboard webpage](https://tatsu-lab.github.io/alpaca_eval/) for up-to-date results.

	We also conducted preliminary evaluations on other benchmarks and observed no significant degradation.

	\| \| ARC \| HellaSwag \| MMLU \| TruthfulQA \| Winogrande \| Avg. \|
	\| ----------------- \| ----- \| --------- \| ----- \| ---------- \| ---------- \| ----- \|
	\| Storm-7B \| 67.58 \| 80.97 \| 62.21 \| 57.24 \| 80.51 \| 69.70 \|
	\| openchat-3.5-0106 \| 66.38 \| 83.00 \| 63.47 \| 52.55 \| 81.06 \| 69.29 \|
	\| internlm2-7b \| 58.02 \| 81.24 \| 65.24 \| 48.73 \| 83.82 \| 67.41 \|
	\| gemma-7B \| 61.09 \| 82.20 \| 64.56 \| 44.79 \| 79.01 \| 66.33 \|
	\| Yi-9B \| 61.18 \| 78.82 \| 70.06 \| 42.45 \| 77.51 \| 66.00 \|
	\| Meta-Llama-3-8B \| 59.47 \| 82.09 \| 66.69 \| 43.90 \| 77.35 \| 65.90 \|
	\| Mistral-7B-v0.1 \| 59.98 \| 83.31 \| 64.16 \| 42.15 \| 78.37 \| 65.59 \|
	\| Qwen-7b \| 51.37 \| 78.47 \| 59.84 \| 47.79 \| 72.69 \| 62.03 \|

	## Uses

	Our model uses the same chat template as [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106). A sample code snippet for inference using our model is provided below.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	device = "cuda"

	model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
	tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
	model.eval().requires_grad_(False)

	def generate_response(prompt):
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
	outputs = model.generate(
	input_ids,
	max_length=2048,
	do_sample=True,
	temperature=1.0,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)
	response_ids = outputs[0]
	response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
	return response_text

	prompt = "How does a telescope work?"
	input_prompt = f"GPT4 Correct User: {prompt}<\|end_of_turn\|>GPT4 Correct Assistant:"
	response_text = generate_response(input_prompt)
	print("Response:", response_text)
	```

	## Scripts
	You can reproduce our results on AlphaEval 2.0 using the script provided below.
	```bash
	git clone https://github.com/tatsu-lab/alpaca_eval.git
	cd alpaca_eval
	pip install -e .
	export OPENAI_API_KEY=<your_api_key>
	alpaca_eval evaluate_from_model --model_configs 'Storm-7B'
	```

	## Limitations

	Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.

	## Citation

	```
	@misc{liu2024storm,
	title = {Storm-7B: An Empirical Study of Iterative Direct Preference Optimization},
	url = {},
	author = {Jie Liu and Zhanhui Zhou and Chao Yang and Han-Sen Zhong and Wanli Ouyang},
	month = {April},
	year = {2024}
	}
	```