File size: 10,505 Bytes
85e689b
 
 
 
 
 
 
 
 
 
b9fd267
85e689b
b9fd267
 
 
 
 
 
53facba
85e689b
 
 
 
 
891d642
 
2f6f3b8
85e689b
24517e4
85e689b
 
 
 
 
 
 
 
 
 
 
 
 
6cd2bd0
85e689b
6cd2bd0
85e689b
891d642
 
2f6f3b8
 
 
 
 
d5d4fbf
2f6f3b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
891d642
8750422
24517e4
891d642
 
 
 
1c86ca5
891d642
 
 
 
 
 
 
 
1c86ca5
8750422
 
24517e4
8750422
 
 
 
 
 
 
 
 
31762c4
8750422
 
31762c4
8750422
fbf41fa
 
2f6f3b8
24517e4
fbf41fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5d4fbf
 
 
 
 
b9fd267
d5d4fbf
b72951e
 
d5d4fbf
 
 
 
 
 
 
 
 
 
 
58b7fb7
d5d4fbf
 
 
 
 
 
 
 
58b7fb7
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
license: other
license_name: motif-license
license_link: LICENSE
language:
- en
---

# Introduction

We announce **Motif 2.6B**, a 2.6 billion parameter language model trained from scratch on AMD Instinct™ MI250X GPUs. Motif 2.6B marks our very first step toward building helpful, reliable AI aligned with human values. With this initial release, our goal is for Motif 2.6B to match the performance of well-known open-source models such as Phi, Llama, and Qwen — particularly those in the sLLM regime.

# Training information

- GPUs: 384 MI250X
- Training time: 42 days
- Training data: 2.4T tokens

*A detailed technical report will be released at a later time.*

# Evaluation

When models are released, their accompanying technical reports or papers often present benchmark results based on evaluation settings chosen by the developers. While this is a common and understandable practice, it can lead to challenges when comparing models across different organizations. The same model may yield different scores depending on evaluation conditions, and details of these conditions are not always fully disclosed. This lack of standardization can make it difficult for the open-source community to interpret and trust reported results. We therefore reference performance scores based on the official numbers reported by each model’s developers in their respective publications.

To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the **Evaluation Appendix**.

### Comparison to Mistral 7B by Mistral AI

The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).

|Benchmark|Metric|Mistral 7B|Motif 2.6B|Improvement|
|---|---|---|---|---|
|MMLU|5-shot|60.1|57.93|-3.61%|
|HellaSwag|0-shot|81.3|61.35|-24.54%|
|WinoG|0-shot|75.3|59.91|-20.44%|
|PIQA|0-shot|83|75.95|-8.49%|
|Arc-e|0-shot|80|87.21|+9.01%|
|Arc-c|0-shot|55.5|74.2|+33.69%|
|NQ|5-shot|28.8|11.14|-61.32%|
|TriviaQA|5-shot|69.9|54.97|-21.36%|
|HumanEval|0-shot|30.5|68.3|+123.93%|
|MBPP|3-shot|47.5|60.3|+26.95%|
|MATH|4-shot, maj@4|13.1|40.2*|+206.87%|
|GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
||||**Average**|**+33.77%**|

\* : We report the 4-shot score instead of the 4-shot, maj@4.

### Comparison to the Gemma series by Google

#### Gemma 1 & 2
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).

*Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters.*

|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|---|---|---|---|---|---|---|---|---|---|---|
|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
|AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
|HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
|MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
|ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
|PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
|SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
|Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
|TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
|NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
|HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|

\*: We were unable to find an evaluation framework for this benchmark.

#### Gemma 3
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).

|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
|---|---|---|---|---|---|---|
|HellaS|10-shot|62.3|77.2|69.89|+12.18%|-9.47%|
|BoolQ|0-shot|63.2|72.3|67.76|+7.22%|-6.28%|
|PIQA|0-shot|73.8|79.6|75.59|+2.43%|-5.04%|
|SIQA|0-shot|48.9|51.9|61.97|+26.73%|+19.40%|
|TQA|5-shot|39.8|65.8|54.97|+38.12%|-16.46%|
|NQ|5-shot|9.48|20|10.91|+15.08%|-45.45%|
|ARC-C|25-shot|38.4|56.2|75.08|+95.52%|+33.59%|
|ARC-E|0-shot|73|82.4|87.21|+19.47%|+5.84%|
|WinoG|5-shot|58.2|64.7|67.09|+15.27%|+3.69%|
|BBH|few-shot, CoT|28.4|50.9|48.56|+70.99%|-4.60%|
|Drop|1-shot, F1|42.4|60.1|29.33|-30.83%|-51.20%|
|MMLU|5-shot|-|59.6|57.93|-|-2.80%|
|MMLUpro|5-shot, CoT|-|29.2|-|-|-|
|AGIE|3-5-shot|-|42.1|-|-|-|
|MATH|4-shot, CoT|-|24.2|40.2|-|+66.12%|
|GSM8K|8-shot, CoT|-|38.4|77.71|-|+102.37%|
|GPQA Diamond|5-shot, CoT|-|15|31.81|-|+112.07%|
|MBPP|3-shot|-|46|60.3|-|+31.09%|
|HumanE|0-shot|-|36|68.3|-|+89.72%|
|IFEval|-|80.2|90.2|74.02|-7.71%|-17.94%|
|||||**Average**|**+22.04%**|**+16.93%**|

### Comparison to the Llama series by Meta

#### Llama 3
The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).

|Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
|---|---|---|---|---|
|MMLU|5-shot|69.4|57.93|-16.53%|
|MMLU|0-shot, CoT|73|57.95|-20.62%|
|MMLU-Pro|5-shot, CoT|48.3|-|-|
|IFEval|-|80.4|74.02|-7.94%|
|HumanEval|0-shot|72.6|68.3|-5.92%|
|MBPP|0-shot|72.8|57.93|-20.43%|
|GSM8K|8-shot, CoT|84.5|77.71|-8.04%|
|MATH|0-shot, CoT|51.9|49.68|-4.28%|
|ARC Challenge|0-shot|83.4|74.2|-11.03%|
|GPQA|0-shot, CoT|32.8|18.53|-43.51%|
||||**Average**|**-15.36%**|

#### Llama 3.2
The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).

|Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
|---|---|---|---|---|---|---|
|MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%|
|Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
|TLDR9+*|test, 1-shot, rougeL|16.8|19|-|-|-|
|IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
|GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
|MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
|ARC Challenge|0-shot|59.4|78.6|74.2|+24.92%|-5.6%|
|GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
|||||**Average**|**+39.42%**|**-3.86%**|

\*: We were unable to find an evaluation framework for this benchmark.

### Comparison to the Phi series by Microsoft
The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).

|Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
|---|---|---|---|---|---|---|---|---|
|MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
|HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
|ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
|GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
|MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
|MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
|AGIEval*|0-shot|37.5|45.1|29.8|-|-|-|-|
|TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%|
|Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%|
|Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%|
|PIQA|5-shot|84.2|86.9|60.2|78.29|-7.02%|-9.91%|+30.05%|
|SociQA|5-shot|76.6|79.2|68.3|66.73|-12.89%|-15.74%|-2.3%|
|BigBench-Hard|3-shot, CoT|71.7|79.1|59.4|48.56|-32.27%|-38.61%|-18.25%|
|WinoGrande|5-shot|70.8|81.5|54.7|67.09|-5.24%|-17.68%|+22.65%|
|OpenBookQA|10-shot|83.2|88|73.6|87.8|+5.53%|-0.23%|+19.29%|
|BoolQ|2-shot|77.2|84.8|-|70.7|-8.42%|-16.63%|-|
|CommonSenseQA|10-shot|80.2|80|69.3|71.25|-11.16%|-10.94%|2.81%|
|TruthfulQA|10-shot|65|70.2|-|52.07|-19.89%|-25.83%|-|
|HumanEval|0-shot|58.5|61|59|68.29|+16.74%|+11.95%|+15.75%|
|MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
|GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|

\*: We were unable to find an evaluation framework for this benchmark.


## Evaluation Appendix

In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -14.78% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports. However, when compared to the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B shows an average improvement of +18.55% over Llama 3 8B and +2.63% over Gemma 2 9B. See the table below for details.

### Comparison to Llama 3 8B and Gemma 2 9B based on scores from the *Qwen2.5 technical report*
The benchmarks and corresponding scores listed in the table below are taken directly from the [Qwen2.5 technical report](https://arxiv.org/abs/2412.15115).

|Benchmark|Metric|Llama 3 8B|Gemma 2 9B|Motif 2.6B|Improvement(over Llama 3 8B)|Improvement(over Gemma 2 9B)|
|---|---|---|---|---|---|---|
|MMLU|5-shot|66.6|71.3|57.93|-13.02%|-18.75%|
|MMLU-pro|5-shot|35.4|44.7|28.4|-19.77%|-36.47%|
|MMLU-redux|5-shot|61.6|67.9|59.54|-3.34%|-12.31%|
|BBH|3-shot|57.7|68.2|39.28|-31.92%|-42.40%|
|ARC-C|25-shot|59.3|68.2|75.08|+26.61%|+10.09%|
|TruthfulQA|0-shot|44|45.3|41.55|-5.56%|-8.27%|
|Winogrande|5-shot|77.4|79.5|67.09|-13.32%|-15.61%|
|HellaSwag|10-shot|82.1|81.9|69.88|-14.88%|-14.68%|
|GPQA|5-shot|25.8|32.8|29.24|+13.33%|-10.85%|
|TheoremQA|5-shot|22.1|28.9|-|-|-|
|MATH|4-shot|20.5|37.7|40.2|+96.10%|+6.63%|
|MMLU-stem|5-shot|55.3|65.1|52.9|-4.34%|-18.74%|
|GSM8K|4-shot|55.3|70.7|68.84|+24.48%|-2.63%|
|HumanEval|0-shot|33.5|37.8|68.3|+103.88%|+80.69%|
|HumanEval+|0-shot|29.3|30.5|62.2|+112.29%|+103.93%|
|MBPP|0-shot|53.9|62.2|60.3|+11.87%|-3.05%|
|MBPP+|0-shot|44.4|50.6|50.8|+14.41%|+0.40%|
|MultiPL-E|0-shot|22.6|34.9|-|-|-|
|||||**Average**|**+18.55%**|**+1.12%**|