Last update: 11th June 2025

Introduction

We announce Motif 2.6B, a 2.6 billion parameter language model trained from scratch on AMD Instinct™ MI250 GPUs. Motif 2.6B marks our very first step toward building helpful, reliable AI aligned with human values. With this initial release, our goal is for Motif 2.6B to match the performance of well-known open-source models such as Gemma, Llama, and Phi — particularly those in the sLLM regime.

Training information

GPUs: 384 MI250
Training time: 42 days
Training data: 2.4T tokens

Notice: A detailed technical report will be released at a later time.

Evaluation

When models are released, their accompanying technical reports or papers often present benchmark results based on evaluation settings chosen by the developers. While this is a common and understandable practice, it can lead to challenges when comparing models across different organizations. The same model may yield different scores depending on evaluation conditions, and details of these conditions are not always fully disclosed. This lack of standardization can make it difficult for the open-source community to interpret and trust reported results. We therefore reference performance scores based on the official numbers reported by each model’s developers in their respective publications.

To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the Evaluation Appendix.

Comparison to Mistral 7B by Mistral AI

The benchmarks and corresponding scores listed in the table below are taken directly from the Mistral 7B technical report.

Benchmark	Metric	Mistral 7B	Motif 2.6B	Improvement
MMLU	5-shot	60.1	57.93	-3.61%
HellaSwag	0-shot	81.3	61.35	-24.54%
WinoG	0-shot	75.3	59.91	-20.44%
PIQA	0-shot	83	75.95	-8.49%
Arc-e	0-shot	80	87.21	+9.01%
Arc-c	0-shot	55.5	74.2	+33.69%
NQ	5-shot	28.8	11.14	-61.32%
TriviaQA	5-shot	69.9	54.97	-21.36%
HumanEval	0-shot	30.5	68.3	+123.93%
MBPP	3-shot	47.5	60.3	+26.95%
MATH	4-shot, maj@4	13.1	40.2*	+206.87%
GSM8K	8-shot, maj@8	52.2	80.21	+53.66%
			Average	+34.25%

* : We report the 4-shot, maj@1 score instead of the 4-shot, maj@4.

Comparison to the Gemma series by Google

Gemma 1 & 2

The benchmarks and corresponding scores listed in the table below are taken directly from the Gemma 2 technical report.

Note: Although referred to as "2B", Gemma 2 2B actually has 2.6 billion parameters.

Benchmark	Metric	Gemma 1 2B	Gemma 1 7B	Gemma 2 2B	Gemma 2 9B	Motif 2.6B	Improvement(over 1 1B)	Improvement(over 1 7B)	Improvement(over 2 2B)	Improvement(over 2 9B)
MMLU	5-shot	42.3	64.4	52.2	71.3	57.93	+36.95%	-10.05%	+10.98%	-18.75%
ARC-C	25-shot	48.5	61.1	55.7	68.4	75.08	+54.80%	+22.88%	+34.79%	+9.77%
GSM8K	5-shot	15.1	51.8	24.3	68.6	75.13	+397.55%	+45.04%	+309.18%	+9.52%
AGIEval	3-5-shot	24.2	44.9	31.5	52.8	-	-	-	-	-
DROP	3-shot, F1	48.5	56.3	51.2	69.4	29.33	-39.53%	-47.90%	-42.71%	-57.74%
BBH	3-shot, CoT	35.2	59	41.9	68.2	48.56	37.95%	-17.69%	+15.89%	-28.80%
Winogrande	5-shot	66.8	79	71.3	80.6	67.09	+0.43%	-15.08%	-5.90%	-16.76%
HellaSwag	10-shot	71.7	82.3	72.9	81.9	69.89	-2.52%	-15.08%	-4.13%	-14.66%
MATH	4-shot	11.8	24.3	16	36.6	40.2	+240.88%	+65.43%	+151.25%	+9.84%
ARC-e	0-shot	73.2	81.5	80.6	88	87.21	+19.14%	+7.01%	+8.20%	-0.90%
PIQA	0-shot	77.3	81.2	78.4	81.7	75.95	-1.75%	-6.47%	-3.13%	-7.04%
SIQA	0-shot	49.7	51.8	51.9	53.4	61.97	+24.69%	+19.63%	+19.40%	+16.05%
Boolq	0-shot	69.4	83.2	72.7	84.2	67.76	-2.36%	-18.56%	-6.80%	-19.52%
TriviaQA	5-shot	53.2	63.4	60.4	76.6	54.97	+3.33%	-13.30%	-8.99%	-28.24%
NQ	5-shot	12.5	23	17.1	29.2	10.91	-12.72%	-52.57%	-36.20%	-62.64%
HumanEval	pass@1	22	32.3	20.1	40.2	68.3	+210.45%	+111.46%	+239.80%	+69.90%
MBPP	3-shot	29.2	44.4	30.2	52.4	60.3	+106.51%	+35.81%	+99.67%	+15.08%
						Average	+90.79%	+3.44%	+46.17%	-13.45%

Gemma 3

The benchmarks and corresponding scores listed in the table below are taken directly from the Gemma 3 technical report.

Benchmark	Metric	Gemma 3 1B	Gemma 3 4B	Motif 2.6B	Improvement(over 1B)	Improvement(over 4B)
HellaS	10-shot	62.3	77.2	69.89	+12.18%	-9.47%
BoolQ	0-shot	63.2	72.3	67.76	+7.22%	-6.28%
PIQA	0-shot	73.8	79.6	75.59	+2.43%	-5.04%
SIQA	0-shot	48.9	51.9	61.97	+26.73%	+19.40%
TQA	5-shot	39.8	65.8	54.97	+38.12%	-16.46%
NQ	5-shot	9.48	20	10.91	+15.08%	-45.45%
ARC-C	25-shot	38.4	56.2	75.08	+95.52%	+33.59%
ARC-E	0-shot	73	82.4	87.21	+19.47%	+5.84%
WinoG	5-shot	58.2	64.7	67.09	+15.27%	+3.69%
BBH	few-shot, CoT	28.4	50.9	48.56	+70.99%	-4.60%
Drop	1-shot, F1	42.4	60.1	29.33	-30.83%	-51.20%
MMLU	5-shot	-	59.6	57.93	-	-2.80%
MMLUpro	5-shot, CoT	-	29.2	-	-	-
AGIE	3-5-shot	-	42.1	-	-	-
MATH	4-shot, CoT	-	24.2	40.2	-	+66.12%
GSM8K	8-shot, CoT	-	38.4	80.21	-	+108.88%
GPQA Diamond	5-shot, CoT	-	15	31.81	-	+112.07%
MBPP	3-shot	-	46	60.3	-	+31.09%
HumanE	0-shot	-	36	68.3	-	+89.72%
IFEval	-	80.2	90.2	74.02	-7.71%	-17.94%
				Average	+22.04%	+17.29%

Comparison to the Llama series by Meta

Llama 3

The benchmarks and corresponding scores listed in the table below are taken directly from the Llama 3 technical report.

Benchmark	Metric	Llama 3 8B	Motif 2.6B	Improvement
MMLU	5-shot	69.4	57.93	-16.53%
MMLU	0-shot, CoT	73	57.95	-20.62%
MMLU-Pro	5-shot, CoT	48.3	-	-
IFEval	-	80.4	74.02	-7.94%
HumanEval	0-shot	72.6	68.3	-5.92%
MBPP	0-shot	72.8	57.93	-20.43%
GSM8K	8-shot, CoT	84.5	80.21	-5.08%
MATH	0-shot, CoT	51.9	49.68	-4.28%
ARC Challenge	0-shot	83.4	74.2	-11.03%
GPQA	0-shot, CoT	32.8	18.53	-43.51%
			Average	-15.04%

Llama 3.2

The benchmarks and corresponding scores listed in the table below are taken directly from the Llama 3.2 official blog.

Benchmark	Metric	Llama 3.2 1B	Llama 3.2 3B	Motif 2.6B	Improvement(over 1B)	Improvement(over 3B)
MMLU	0-shot	49.3	63.4	57.6	+16.75%	-9.21%
Open-rewrite eval*	0-shot, rougeL	41.6	40.1	-	-	-
TLDR9+	test, 1-shot, rougeL	16.8	19	-	-	-
IFEval	-	59.5	77.4	74.02	+24.40%	-4.37%
GSM8K	8-shot, CoT	44.4	77.7	80.21	+80.65%	+3.23%
MATH	0-shot, CoT	30.6	48	49.68	+62.35%	+3.50%
ARC Challenge	0-shot	59.4	78.6	74.2	+24.92%	-5.6%
GPQA	0-shot	27.2	32.8	25.45	-6.43%	-22.41%
Hellaswag	0-shot	41.2	69.8	61.35	+48.91%	-12.11%
				Average	+41.82%	-2.49%

Comparison to the Phi series by Microsoft

The benchmarks and corresponding scores listed in the table below are taken directly from the Phi-3 technical report.

Benchmark	Metric	Phi-3 3.8B	Phi-3 7B	Phi-2 2.7B	Motif 2.6B	Improvement(over 3.8B)	Improvement(over 7B)	Improvement(over 2.7B)
MMLU	5-shot	68.8	75.7	56.3	57.93	-15.80%	-23.47%	+2.90%
HellaSwag	5-shot	76.7	77	53.6	68.97	-10.08%	-10.43%	+28.68%
ANLI	7-shot	52.8	58.1	42.5	47.99	-9.11%	-17.40%	+12.92%
GSM-8K	8-shot, CoT	82.5	89.6	61.1	80.21	-2.78%	-10.48%	+31.28%
MATH	0-shot, CoT	41.3	34.6	-	49.68	+20.29%	+43.58%	-
MedQA	2-shot	53.8	65.4	40.9	42.1	-21.75%	-35.63%	+2.93%
AGIEval	0-shot	37.5	45.1	29.8	-	-	-	-
TriviaQA	5-shot	64	58.1	45.2	54.97	-14.11%	-5.39%	+21.62%
Arc-C	10-shot	84.9	90.7	75.9	75.17	-11.46%	-17.12%	-0.96%
Arc-E	10-shot	94.6	97	88.5	88.64	-6.30%	-8.62%	+0.16%
PIQA	5-shot	84.2	86.9	60.2	78.29	-7.02%	-9.91%	+30.05%
SociQA	5-shot	76.6	79.2	68.3	66.73	-12.89%	-15.74%	-2.3%
BigBench-Hard	3-shot, CoT	71.7	79.1	59.4	48.56	-32.27%	-38.61%	-18.25%
WinoGrande	5-shot	70.8	81.5	54.7	67.09	-5.24%	-17.68%	+22.65%
OpenBookQA	10-shot	83.2	88	73.6	87.8	+5.53%	-0.23%	+19.29%
BoolQ	2-shot	77.2	84.8	-	70.7	-8.42%	-16.63%	-
CommonSenseQA	10-shot	80.2	80	69.3	71.25	-11.16%	-10.94%	2.81%
TruthfulQA	10-shot	65	70.2	-	52.07	-19.89%	-25.83%	-
HumanEval	0-shot	58.5	61	59	68.29	+16.74%	+11.95%	+15.75%
MBPP	3-shot	70	71.7	60.6	60.3	-13.86%	-15.90%	-0.50%
GPQA	2-shot, CoT	32.8	34.3	-	27.9	-14.94%	-18.66%	-
MT Bench	2R. Avg.	8.38	8.7	-	6.77	-19.21%	-22.18%	-
					Average	-9.87%	-13.25%	+10.56%

Evaluation Appendix

In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -13.45% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports. However, when compared to the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B shows an average improvement of +19.27% over Llama 3 8B and +1.68% over Gemma 2 9B. See the table below for details.

Comparison to Llama 3 8B and Gemma 2 9B based on scores from the Qwen2.5 technical report

The benchmarks and corresponding scores listed in the table below are taken directly from the Qwen2.5 technical report.

Benchmark	Metric	Llama 3 8B	Gemma 2 9B	Motif 2.6B	Improvement(over Llama 3 8B)	Improvement(over Gemma 2 9B)
MMLU	5-shot	66.6	71.3	57.93	-13.02%	-18.75%
MMLU-pro	5-shot	35.4	44.7	28.4	-19.77%	-36.47%
MMLU-redux	5-shot	61.6	67.9	59.54	-3.34%	-12.31%
BBH	3-shot	57.7	68.2	39.28	-31.92%	-42.40%
ARC-C	25-shot	59.3	68.2	75.08	+26.61%	+10.09%
TruthfulQA	0-shot	44	45.3	41.55	-5.56%	-8.27%
Winogrande	5-shot	77.4	79.5	67.09	-13.32%	-15.61%
HellaSwag	10-shot	82.1	81.9	69.88	-14.88%	-14.68%
GPQA	5-shot	25.8	32.8	29.24	+13.33%	-10.85%
TheoremQA	5-shot	22.1	28.9	-	-	-
MATH	4-shot	20.5	37.7	40.2	+96.10%	+6.63%
MMLU-stem	5-shot	55.3	65.1	52.9	-4.34%	-18.74%
GSM8K	4-shot	55.3	70.7	75.2	+35.99%	+6.36%
HumanEval	0-shot	33.5	37.8	68.3	+103.88%	+80.69%
HumanEval+	0-shot	29.3	30.5	62.2	+112.29%	+103.93%
MBPP	0-shot	53.9	62.2	60.3	+11.87%	-3.05%
MBPP+	0-shot	44.4	50.6	50.8	+14.41%	+0.40%
MultiPL-E	0-shot	22.6	34.9	-	-	-
				Average	+19.27%	+1.68%

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Motif-Technologies/Motif-2.6B",
    trust_remote_code = True, 
    _attn_implementation = "eager", # also supports flash_attention_2
).cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "Motif-Technologies/Motif-2.6B", 
    trust_remote_code = True, 
)

query = "What is the capital city of South Korea?"
input_ids = tokenizer.apply_chat_template(
    [
        {'role': 'system', 'content': 'you are an helpful assistant'},
        {'role': 'user', 'content': query},
    ],
    add_generation_prompt = True,
    return_tensors='pt',
).cuda()

output = model.generate(input_ids, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(output[0, input_ids.shape[-1]:], skip_special_tokens = True)
print(output)

"""
The capital city of South Korea is Seoul. Located in the southern part of the country, Seoul is not only the largest city in South Korea but also one of the largest metropolitan areas in the world.
It is a vibrant and dynamic city known for its rich history, cultural heritage, and modern amenities. Seoul is a major economic, cultural, and political center in East Asia, and it plays a crucial role in the region's politics, economy, and culture.
The city is divided into different administrative districts, each with its own unique characteristics and attractions.
"""