Update README.md
Browse files
README.md
CHANGED
@@ -81,6 +81,16 @@ response_text = generate_response(input_prompt)
|
|
81 |
print("Response:", response_text)
|
82 |
```
|
83 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
## Limitations
|
85 |
|
86 |
Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.
|
|
|
81 |
print("Response:", response_text)
|
82 |
```
|
83 |
|
84 |
+
## Scripts
|
85 |
+
You can reproduce our results on AlphaEval 2.0 using the script provided below.
|
86 |
+
```bash
|
87 |
+
git clone https://github.com/tatsu-lab/alpaca_eval.git
|
88 |
+
cd alpaca_eval
|
89 |
+
pip install -e .
|
90 |
+
export OPENAI_API_KEY=<your_api_key>
|
91 |
+
alpaca_eval evaluate_from_model --model_configs 'Storm-7B'
|
92 |
+
```
|
93 |
+
|
94 |
## Limitations
|
95 |
|
96 |
Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.
|