Last update: 11th June 2025

Introduction

We announce Motif 2.6B, a 2.6 billion parameter language model trained from scratch on AMD Instinct™ MI250 GPUs. Motif 2.6B marks our very first step toward building helpful, reliable AI aligned with human values. With this initial release, our goal is for Motif 2.6B to match the performance of well-known open-source models such as Gemma, Llama, and Phi — particularly those in the sLLM regime.

Training information

  • GPUs: 384 MI250
  • Training time: 42 days
  • Training data: 2.4T tokens

Notice: A detailed technical report will be released at a later time.

Evaluation

When models are released, their accompanying technical reports or papers often present benchmark results based on evaluation settings chosen by the developers. While this is a common and understandable practice, it can lead to challenges when comparing models across different organizations. The same model may yield different scores depending on evaluation conditions, and details of these conditions are not always fully disclosed. This lack of standardization can make it difficult for the open-source community to interpret and trust reported results. We therefore reference performance scores based on the official numbers reported by each model’s developers in their respective publications.

To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the Evaluation Appendix.

Comparison to Mistral 7B by Mistral AI

The benchmarks and corresponding scores listed in the table below are taken directly from the Mistral 7B technical report.

Benchmark Metric Mistral 7B Motif 2.6B Improvement
MMLU 5-shot 60.1 57.93 -3.61%
HellaSwag 0-shot 81.3 61.35 -24.54%
WinoG 0-shot 75.3 59.91 -20.44%
PIQA 0-shot 83 75.95 -8.49%
Arc-e 0-shot 80 87.21 +9.01%
Arc-c 0-shot 55.5 74.2 +33.69%
NQ 5-shot 28.8 11.14 -61.32%
TriviaQA 5-shot 69.9 54.97 -21.36%
HumanEval 0-shot 30.5 68.3 +123.93%
MBPP 3-shot 47.5 60.3 +26.95%
MATH 4-shot, maj@4 13.1 40.2* +206.87%
GSM8K 8-shot, maj@8 52.2 80.21 +53.66%
Average +34.25%

* : We report the 4-shot, maj@1 score instead of the 4-shot, maj@4.

Comparison to the Gemma series by Google

Gemma 1 & 2

The benchmarks and corresponding scores listed in the table below are taken directly from the Gemma 2 technical report.

Note: Although referred to as "2B", Gemma 2 2B actually has 2.6 billion parameters.

Benchmark Metric Gemma 1 2B Gemma 1 7B Gemma 2 2B Gemma 2 9B Motif 2.6B Improvement(over 1 1B) Improvement(over 1 7B) Improvement(over 2 2B) Improvement(over 2 9B)
MMLU 5-shot 42.3 64.4 52.2 71.3 57.93 +36.95% -10.05% +10.98% -18.75%
ARC-C 25-shot 48.5 61.1 55.7 68.4 75.08 +54.80% +22.88% +34.79% +9.77%
GSM8K 5-shot 15.1 51.8 24.3 68.6 75.13 +397.55% +45.04% +309.18% +9.52%
AGIEval 3-5-shot 24.2 44.9 31.5 52.8 - - - - -
DROP 3-shot, F1 48.5 56.3 51.2 69.4 29.33 -39.53% -47.90% -42.71% -57.74%
BBH 3-shot, CoT 35.2 59 41.9 68.2 48.56 37.95% -17.69% +15.89% -28.80%
Winogrande 5-shot 66.8 79 71.3 80.6 67.09 +0.43% -15.08% -5.90% -16.76%
HellaSwag 10-shot 71.7 82.3 72.9 81.9 69.89 -2.52% -15.08% -4.13% -14.66%
MATH 4-shot 11.8 24.3 16 36.6 40.2 +240.88% +65.43% +151.25% +9.84%
ARC-e 0-shot 73.2 81.5 80.6 88 87.21 +19.14% +7.01% +8.20% -0.90%
PIQA 0-shot 77.3 81.2 78.4 81.7 75.95 -1.75% -6.47% -3.13% -7.04%
SIQA 0-shot 49.7 51.8 51.9 53.4 61.97 +24.69% +19.63% +19.40% +16.05%
Boolq 0-shot 69.4 83.2 72.7 84.2 67.76 -2.36% -18.56% -6.80% -19.52%
TriviaQA 5-shot 53.2 63.4 60.4 76.6 54.97 +3.33% -13.30% -8.99% -28.24%
NQ 5-shot 12.5 23 17.1 29.2 10.91 -12.72% -52.57% -36.20% -62.64%
HumanEval pass@1 22 32.3 20.1 40.2 68.3 +210.45% +111.46% +239.80% +69.90%
MBPP 3-shot 29.2 44.4 30.2 52.4 60.3 +106.51% +35.81% +99.67% +15.08%
Average +90.79% +3.44% +46.17% -13.45%

Gemma 3

The benchmarks and corresponding scores listed in the table below are taken directly from the Gemma 3 technical report.

Benchmark Metric Gemma 3 1B Gemma 3 4B Motif 2.6B Improvement(over 1B) Improvement(over 4B)
HellaS 10-shot 62.3 77.2 69.89 +12.18% -9.47%
BoolQ 0-shot 63.2 72.3 67.76 +7.22% -6.28%
PIQA 0-shot 73.8 79.6 75.59 +2.43% -5.04%
SIQA 0-shot 48.9 51.9 61.97 +26.73% +19.40%
TQA 5-shot 39.8 65.8 54.97 +38.12% -16.46%
NQ 5-shot 9.48 20 10.91 +15.08% -45.45%
ARC-C 25-shot 38.4 56.2 75.08 +95.52% +33.59%
ARC-E 0-shot 73 82.4 87.21 +19.47% +5.84%
WinoG 5-shot 58.2 64.7 67.09 +15.27% +3.69%
BBH few-shot, CoT 28.4 50.9 48.56 +70.99% -4.60%
Drop 1-shot, F1 42.4 60.1 29.33 -30.83% -51.20%
MMLU 5-shot - 59.6 57.93 - -2.80%
MMLUpro 5-shot, CoT - 29.2 - - -
AGIE 3-5-shot - 42.1 - - -
MATH 4-shot, CoT - 24.2 40.2 - +66.12%
GSM8K 8-shot, CoT - 38.4 80.21 - +108.88%
GPQA Diamond 5-shot, CoT - 15 31.81 - +112.07%
MBPP 3-shot - 46 60.3 - +31.09%
HumanE 0-shot - 36 68.3 - +89.72%
IFEval - 80.2 90.2 74.02 -7.71% -17.94%
Average +22.04% +17.29%

Comparison to the Llama series by Meta

Llama 3

The benchmarks and corresponding scores listed in the table below are taken directly from the Llama 3 technical report.

Benchmark Metric Llama 3 8B Motif 2.6B Improvement
MMLU 5-shot 69.4 57.93 -16.53%
MMLU 0-shot, CoT 73 57.95 -20.62%
MMLU-Pro 5-shot, CoT 48.3 - -
IFEval - 80.4 74.02 -7.94%
HumanEval 0-shot 72.6 68.3 -5.92%
MBPP 0-shot 72.8 57.93 -20.43%
GSM8K 8-shot, CoT 84.5 80.21 -5.08%
MATH 0-shot, CoT 51.9 49.68 -4.28%
ARC Challenge 0-shot 83.4 74.2 -11.03%
GPQA 0-shot, CoT 32.8 18.53 -43.51%
Average -15.04%

Llama 3.2

The benchmarks and corresponding scores listed in the table below are taken directly from the Llama 3.2 official blog.

Benchmark Metric Llama 3.2 1B Llama 3.2 3B Motif 2.6B Improvement(over 1B) Improvement(over 3B)
MMLU 0-shot 49.3 63.4 57.6 +16.75% -9.21%
Open-rewrite eval* 0-shot, rougeL 41.6 40.1 - - -
TLDR9+ test, 1-shot, rougeL 16.8 19 - - -
IFEval - 59.5 77.4 74.02 +24.40% -4.37%
GSM8K 8-shot, CoT 44.4 77.7 80.21 +80.65% +3.23%
MATH 0-shot, CoT 30.6 48 49.68 +62.35% +3.50%
ARC Challenge 0-shot 59.4 78.6 74.2 +24.92% -5.6%
GPQA 0-shot 27.2 32.8 25.45 -6.43% -22.41%
Hellaswag 0-shot 41.2 69.8 61.35 +48.91% -12.11%
Average +41.82% -2.49%

Comparison to the Phi series by Microsoft

The benchmarks and corresponding scores listed in the table below are taken directly from the Phi-3 technical report.

Benchmark Metric Phi-3 3.8B Phi-3 7B Phi-2 2.7B Motif 2.6B Improvement(over 3.8B) Improvement(over 7B) Improvement(over 2.7B)
MMLU 5-shot 68.8 75.7 56.3 57.93 -15.80% -23.47% +2.90%
HellaSwag 5-shot 76.7 77 53.6 68.97 -10.08% -10.43% +28.68%
ANLI 7-shot 52.8 58.1 42.5 47.99 -9.11% -17.40% +12.92%
GSM-8K 8-shot, CoT 82.5 89.6 61.1 80.21 -2.78% -10.48% +31.28%
MATH 0-shot, CoT 41.3 34.6 - 49.68 +20.29% +43.58% -
MedQA 2-shot 53.8 65.4 40.9 42.1 -21.75% -35.63% +2.93%
AGIEval 0-shot 37.5 45.1 29.8 - - - -
TriviaQA 5-shot 64 58.1 45.2 54.97 -14.11% -5.39% +21.62%
Arc-C 10-shot 84.9 90.7 75.9 75.17 -11.46% -17.12% -0.96%
Arc-E 10-shot 94.6 97 88.5 88.64 -6.30% -8.62% +0.16%
PIQA 5-shot 84.2 86.9 60.2 78.29 -7.02% -9.91% +30.05%
SociQA 5-shot 76.6 79.2 68.3 66.73 -12.89% -15.74% -2.3%
BigBench-Hard 3-shot, CoT 71.7 79.1 59.4 48.56 -32.27% -38.61% -18.25%
WinoGrande 5-shot 70.8 81.5 54.7 67.09 -5.24% -17.68% +22.65%
OpenBookQA 10-shot 83.2 88 73.6 87.8 +5.53% -0.23% +19.29%
BoolQ 2-shot 77.2 84.8 - 70.7 -8.42% -16.63% -
CommonSenseQA 10-shot 80.2 80 69.3 71.25 -11.16% -10.94% 2.81%
TruthfulQA 10-shot 65 70.2 - 52.07 -19.89% -25.83% -
HumanEval 0-shot 58.5 61 59 68.29 +16.74% +11.95% +15.75%
MBPP 3-shot 70 71.7 60.6 60.3 -13.86% -15.90% -0.50%
GPQA 2-shot, CoT 32.8 34.3 - 27.9 -14.94% -18.66% -
MT Bench 2R. Avg. 8.38 8.7 - 6.77 -19.21% -22.18% -
Average -9.87% -13.25% +10.56%

Evaluation Appendix

In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -13.45% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports. However, when compared to the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B shows an average improvement of +19.27% over Llama 3 8B and +1.68% over Gemma 2 9B. See the table below for details.

Comparison to Llama 3 8B and Gemma 2 9B based on scores from the Qwen2.5 technical report

The benchmarks and corresponding scores listed in the table below are taken directly from the Qwen2.5 technical report.

Benchmark Metric Llama 3 8B Gemma 2 9B Motif 2.6B Improvement(over Llama 3 8B) Improvement(over Gemma 2 9B)
MMLU 5-shot 66.6 71.3 57.93 -13.02% -18.75%
MMLU-pro 5-shot 35.4 44.7 28.4 -19.77% -36.47%
MMLU-redux 5-shot 61.6 67.9 59.54 -3.34% -12.31%
BBH 3-shot 57.7 68.2 39.28 -31.92% -42.40%
ARC-C 25-shot 59.3 68.2 75.08 +26.61% +10.09%
TruthfulQA 0-shot 44 45.3 41.55 -5.56% -8.27%
Winogrande 5-shot 77.4 79.5 67.09 -13.32% -15.61%
HellaSwag 10-shot 82.1 81.9 69.88 -14.88% -14.68%
GPQA 5-shot 25.8 32.8 29.24 +13.33% -10.85%
TheoremQA 5-shot 22.1 28.9 - - -
MATH 4-shot 20.5 37.7 40.2 +96.10% +6.63%
MMLU-stem 5-shot 55.3 65.1 52.9 -4.34% -18.74%
GSM8K 4-shot 55.3 70.7 75.2 +35.99% +6.36%
HumanEval 0-shot 33.5 37.8 68.3 +103.88% +80.69%
HumanEval+ 0-shot 29.3 30.5 62.2 +112.29% +103.93%
MBPP 0-shot 53.9 62.2 60.3 +11.87% -3.05%
MBPP+ 0-shot 44.4 50.6 50.8 +14.41% +0.40%
MultiPL-E 0-shot 22.6 34.9 - - -
Average +19.27% +1.68%

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Motif-Technologies/Motif-2.6B",
    trust_remote_code = True, 
    _attn_implementation = "eager", # also supports flash_attention_2
).cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "Motif-Technologies/Motif-2.6B", 
    trust_remote_code = True, 
)

query = "What is the capital city of South Korea?"
input_ids = tokenizer.apply_chat_template(
    [
        {'role': 'system', 'content': 'you are an helpful assistant'},
        {'role': 'user', 'content': query},
    ],
    add_generation_prompt = True,
    return_tensors='pt',
).cuda()

output = model.generate(input_ids, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(output[0, input_ids.shape[-1]:], skip_special_tokens = True)
print(output)

"""
The capital city of South Korea is Seoul. Located in the southern part of the country, Seoul is not only the largest city in South Korea but also one of the largest metropolitan areas in the world.
It is a vibrant and dynamic city known for its rich history, cultural heritage, and modern amenities. Seoul is a major economic, cultural, and political center in East Asia, and it plays a crucial role in the region's politics, economy, and culture.
The city is divided into different administrative districts, each with its own unique characteristics and attractions.
"""
Downloads last month
1,470
Safetensors
Model size
2.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support