File size: 16,454 Bytes
cbe3aaa
 
 
 
 
 
 
 
 
63b75e0
fe0a228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbe3aaa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50fef87
cbe3aaa
 
 
 
 
 
 
 
 
 
 
 
e3f8ea6
 
 
 
 
 
 
 
cbe3aaa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe0a228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
---
license: llama3.1
language:
- en
pipeline_tag: text-generation
datasets:
- allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
base_model:
- allenai/Llama-3.1-Tulu-3-8B-DPO
library_name: transformers
model-index:
- name: Llama-3.1-Tulu-3-8B
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: IFEval (0-Shot)
      type: wis-k/instruction-following-eval
      split: train
      args:
        num_few_shot: 0
    metrics:
    - type: inst_level_strict_acc and prompt_level_strict_acc
      value: 82.55
      name: averaged accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: BBH (3-Shot)
      type: SaylorTwift/bbh
      split: test
      args:
        num_few_shot: 3
    metrics:
    - type: acc_norm
      value: 16.86
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MATH Lvl 5 (4-Shot)
      type: lighteval/MATH-Hard
      split: test
      args:
        num_few_shot: 4
    metrics:
    - type: exact_match
      value: 18.88
      name: exact match
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GPQA (0-shot)
      type: Idavidrein/gpqa
      split: train
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 6.26
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MuSR (0-shot)
      type: TAUR-Lab/MuSR
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 10.52
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU-PRO (5-shot)
      type: TIGER-Lab/MMLU-Pro
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 20.23
      name: accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
      name: Open LLM Leaderboard
---

<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu3/Tulu3-logo.png" alt="Tulu 3 banner" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

# Llama-3.1-Tulu-3-8B

Tülu3 is a leading instruction following model family, offering fully open-source data, code, and recipes designed to serve as a comprehensive guide for modern post-training techniques.
Tülu3 is designed for state-of-the-art performance on a diversity of tasks in addition to chat, such as MATH, GSM8K, and IFEval.

## Model description

- **Model type:** A model trained on a mix of publicly available, synthetic and human-created datasets.
- **Language(s) (NLP):** Primarily English
- **License:** Llama 3.1 Community License Agreement
- **Finetuned from model:** allenai/Llama-3.1-Tulu-3-8B-DPO

### Model Sources

- **Training Repository:** https://github.com/allenai/open-instruct
- **Eval Repository:** https://github.com/allenai/olmes
- **Paper:** https://arxiv.org/abs/2411.15124
- **Demo:** https://playground.allenai.org/

### Model Family

| **Stage**           | **Llama 3.1 8B**                                                                                          | **Llama 3.1 70B**                                                                                         |
|----------------------|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| **Base Model**       | [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)                                | [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B)                              |
| **SFT**              | [allenai/Llama-3.1-Tulu-3-8B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT)                | [allenai/Llama-3.1-Tulu-3-70B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-SFT)              |
| **DPO**              | [allenai/Llama-3.1-Tulu-3-8B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO)                | [allenai/Llama-3.1-Tulu-3-70B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-DPO)              |
| **Final Models (RLVR)**     | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B)                        | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B)                      |
| **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM)                                                     | (Same as 8B)                                                     |


| **Stage** | **Llama 3.1 405B** |
|-----------|-------------------|
| **Base Model** | [meta-llama/llama-3.1-405B](https://huggingface.co/meta-llama/llama-3.1-405B) |
| **SFT** | [allenai/llama-3.1-Tulu-3-405B-SFT](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-SFT) |
| **Final Model (DPO)** | [allenai/llama-3.1-Tulu-3-405B](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B) |


## Using the model

### Loading with HuggingFace

To load the model with HuggingFace, use the following snippet:
```
from transformers import AutoModelForCausalLM

tulu_model = AutoModelForCausalLM.from_pretrained("allenai/Llama-3.1-Tulu-3-8B")
```

### VLLM

As a Llama base model, the model can be easily served with:
```
vllm serve allenai/Llama-3.1-Tulu-3-8B
```
Note that given the long chat template of Llama, you may want to use `--max_model_len=8192`.

### Chat template

The chat template for our models is formatted as:
```
<|user|>\nHow are you doing?\n<|assistant|>\nI'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>
```
Or with new lines expanded:
```
<|user|>
How are you doing?
<|assistant|>
I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>
```
It is embedded within the tokenizer as well, for `tokenizer.apply_chat_template`.

### System prompt

In Ai2 demos, we use this system prompt by default:
```
You are Tulu 3, a helpful and harmless AI Assistant built by the Allen Institute for AI.
```
The model has not been trained with a specific system prompt in mind.

### Bias, Risks, and Limitations

The Tülu3 models have limited safety training, but are not deployed automatically with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). 
It is also unknown what the size and composition of the corpus was used to train the base Llama 3.1 models, however it is likely to have included a mix of Web data and technical sources like books and code. 
See the Falcon 180B model card for an example of this.


## Performance

| Benchmark (eval)                | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
|---------------------------------|----------------|----------------|------------|------------------------|----------------------|-----------|---------------------|-----------------------|
| **Avg.**                        | 60.4           | 64.4           | **64.8**   | 62.2                  | 57.8                | 44.7      | 55.2               | 58.3                 |
| **MMLU (0 shot, CoT)**          | 65.9           | 68.7           | 68.2       | 71.2                  | **76.6**            | 62.0      | 74.6               | 68.5                 |
| **PopQA (15 shot)**             | **29.3**       | 29.3           | 29.1       | 20.2                  | 18.1                | 22.5      | 28.3               | 20.2                 |
| **TruthfulQA (6 shot)**         | 46.8           | 56.1           | 55.0       | 55.1                  | **63.1**            | 57.0      | 61.4               | 55.5                 |
| **BigBenchHard (3 shot, CoT)**  | **67.9**       | 65.8           | 66.0       | 62.8                  | 21.7                | 0.9       | 2.5                | 56.2                 |
| **DROP (3 shot)**               | 61.3           | 62.5           | **62.6**   | 61.5                  | 54.4                | 49.4      | 58.8               | 56.2                 |
| **MATH (4 shot CoT, Flex)**     | 31.5           | 42.0           | **43.7**   | 42.5                  | 14.8                | 5.1       | 29.8               | 40.0                 |
| **GSM8K (8 shot, CoT)**         | 76.2           | 84.3           | **87.6**   | 83.4                  | 83.8                | 61.2      | 79.7               | 80.0                 |
| **HumanEval (pass@10)**         | 86.2           | 83.9           | 83.9       | 86.3                  | **93.1**            | 75.4      | 71.7               | 91.0                 |
| **HumanEval+ (pass@10)**        | 81.4           | 78.6           | 79.2       | 82.9                  | **89.7**            | 69.1      | 67.0               | 88.5                 |
| **IFEval (prompt loose)**       | 72.8           | 81.1           | **82.4**   | 80.6                  | 74.7                | 38.8      | 69.9               | 56.4                 |
| **AlpacaEval 2 (LC % win)**     | 12.4           | 33.5           | 34.5       | 24.2                  | 29.0                | **49.0**  | 43.7               | 31.4                 |
| **Safety (6 task avg.)**        | **93.1**       | 87.2           | 85.5       | 75.2                  | 75.0                | 46.4      | 75.5               | 56.2                 |

| Benchmark (eval)                | Tülu 3 70B SFT | Tülu 3 DPO 70B | Tülu 3 70B | Llama 3.1 70B Instruct | Qwen 2.5 72B Instruct | Hermes 3 Llama 3.1 70B | Nemotron Llama 3.1 70B |
|---------------------------------|-----------------|-----------------|-------------|-------------------------|-----------------------|------------------------|-------------------------|
| **Avg.**                        | 72.6            | 75.9            | **76.0**    | 73.4                   | 71.5                  | 68.3                   | 65.5                   |
| **MMLU (0 shot, CoT)**          | 78.9            | 83.3            | 83.1        | 85.3                   | **85.5**             | 80.4                   | 83.8                   |
| **PopQA (15 shot)**             | **48.6**        | 46.3            | 46.5        | 46.4                   | 30.6                  | 48.1                   | 36.4                   |
| **TruthfulQA (6 shot)**         | 55.7            | 67.9            | 67.6        | 66.8                   | **69.9**             | 66.5                   | 62.6                   |
| **BigBenchHard (3 shot, CoT)**  | **82.7**        | 81.8            | 82.0        | 73.8                   | 67.2                  | 82.1                   | 0.7                    |
| **DROP (3 shot)**               | **77.2**        | 74.1            | 74.3        | 77.0                   | 34.2                  | 73.2                   | 68.8                   |
| **MATH (4 shot CoT, Flex)**     | 53.7            | 62.3            | 63.0        | 56.4                   | **74.3**             | 41.9                   | 55.0                   |
| **GSM8K (8 shot, CoT)**         | 91.1            | 93.5            | 93.5        | **93.7**              | 89.5                  | 90.0                   | 84.7                   |
| **HumanEval (pass@10)**         | 92.9            | 92.4            | 92.4        | 93.6                   | 94.0                  | 89.6                   | **94.1**              |
| **HumanEval+ (pass@10)**        | 87.3            | 88.4            | 88.0        | 89.5                   | **90.8**             | 85.9                   | 85.5                   |
| **IFEval (prompt loose)**       | 82.1            | 82.6            | 83.2        | **88.0**              | 87.6                  | 76.0                   | 79.9                   |
| **AlpacaEval 2 (LC % win)**     | 26.3            | 49.6            | 49.8        | 33.4                   | 47.7                  | 28.4                   | **66.1**              |
| **Safety (6 task avg.)**        | **94.4**        | 89.0            | 88.3        | 76.5                   | 87.0                  | 57.9                   | 69.0                   |


## Hyperparamters

PPO settings for RLVR:
- **Learning Rate**: 3 × 10⁻⁷
- **Discount Factor (gamma)**: 1.0
- **General Advantage Estimation (lambda)**: 0.95
- **Mini-batches (N_mb)**: 1
- **PPO Update Iterations (K)**: 4
- **PPO's Clipping Coefficient (epsilon)**: 0.2
- **Value Function Coefficient (c1)**: 0.1
- **Gradient Norm Threshold**: 1.0
- **Learning Rate Schedule**: Linear
- **Generation Temperature**: 1.0
- **Batch Size (effective)**: 512
- **Max Token Length**: 2,048
- **Max Prompt Token Length**: 2,048
- **Penalty Reward Value for Responses without an EOS Token**: -10.0
- **Response Length**: 1,024 (but 2,048 for MATH)
- **Total Episodes**: 100,000
- **KL penalty coefficient (beta)**: [0.1, 0.05, 0.03, 0.01]
- **Warm up ratio (omega)**: 0.0

## License and use

All Llama 3.1 Tülu3 models are released under Meta's [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/).
Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc.
Tülu3 is intended for research and educational use.
For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).

The models have been fine-tuned using a dataset mix with outputs generated from third party models and are subject to additional terms: 
[Gemma Terms of Use](https://ai.google.dev/gemma/terms) and [Qwen License Agreement](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE) (models were improved using Qwen 2.5).


## Citation

If Tülu3 or any of the related materials were helpful to your work, please cite:
```
@article{lambert2024tulu3,
  title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
  author = {
    Nathan Lambert and 
    Jacob Morrison and 
    Valentina Pyatkin and 
    Shengyi Huang and 
    Hamish Ivison and 
    Faeze Brahman and 
    Lester James V. Miranda and 
    Alisa Liu and 
    Nouha Dziri and 
    Shane Lyu and 
    Yuling Gu and 
    Saumya Malik and 
    Victoria Graf and 
    Jena D. Hwang and 
    Jiangjiang Yang and
    Ronan Le Bras and
    Oyvind Tafjord and
    Chris Wilhelm and
    Luca Soldaini and 
    Noah A. Smith and 
    Yizhong Wang and 
    Pradeep Dasigi and 
    Hannaneh Hajishirzi
  },
  year = {2024},
  email = {[email protected]}
}
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/allenai__Llama-3.1-Tulu-3-8B-details)!
Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=allenai%2FLlama-3.1-Tulu-3-8B&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!

|      Metric       |Value (%)|
|-------------------|--------:|
|**Average**        |    25.88|
|IFEval (0-Shot)    |    82.55|
|BBH (3-Shot)       |    16.86|
|MATH Lvl 5 (4-Shot)|    18.88|
|GPQA (0-shot)      |     6.26|
|MuSR (0-shot)      |    10.52|
|MMLU-PRO (5-shot)  |    20.23|