--- license: other license_name: motif-license license_link: LICENSE language: - en --- # Introduction We announce **Motif 2.6B**, a 2.6 billion parameter language model trained from scratch on AMD Instinct™ MI250X GPUs. Motif 2.6B marks our very first step toward building helpful, reliable AI aligned with human values. With this first release, we aim for Motif 2.6B to achieve performance comparable to well-known open-source models such as Phi, Llama, and Qwen — particularly those in sLLM regime. A detailed technical report will be released at a later time; here, we present the initial evaluation results. # Evaluation When models are released, their accompanying technical reports or papers often present benchmark results based on evaluation settings chosen by the developers. While this is a common and understandable practice, it can lead to challenges when comparing models across different organizations. The same model may yield different scores depending on evaluation conditions, and details of these conditions are not always fully disclosed. This lack of standardization can make it difficult for the open-source community to interpret and trust reported results. We therefore reference performance scores based on the official numbers reported by each model’s developers in their respective publications. To illustrate how much evaluation scores can vary across reports, we provide concrete examples of benchmark score differences for major models in the **Evaluation Appendix**. ### Comparsion to Mistral The benchmarks and metrics used are identical to those in the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825). |Benchmark|Metric|Mistral 7B|Motif 2.6B|Improvement| |---|---|---|---|---| |MMLU|5-shot|60.1|57.93|-3.61%| |HellaSwag|0-shot|81.3|61.35|-24.54%| |WinoG|0-shot|75.3|59.91|-20.44%| |PIQA|0-shot|83|75.95|-8.49%| |Arc-e|0-shot|80|87.21|+9.01%| |Arc-c|0-shot|55.5|74.2|+33.69%| |NQ|5-shot|28.8|11.14|-61.32%| |TriviaQA|5-shot|69.9|54.97|-21.36%| |HumanEval|0-shot|30.5|68.3|+123.93%| |MBPP|3-shot|47.5|60.3|+26.95%| |MATH|4-shot, maj@4|13.1|39.2*|+199.24%| |GSM8K|8-shot, maj@8|52.2|77.71|+48.87%| ||||**Average**|**+33.01%**| \* : We report the 4-shot score instead of the 4-shot, maj@4. ### Comparsion to Llama #### Llama 3 The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783). |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement| |---|---|---|---|---| |MMLU|5-shot|69.4|57.93|-16.53%| |MMLU|0-shot, CoT|73|55.9|-23.42%| |MMLU-Pro|5-shot, CoT|48.3|-|-| |IFEval|-|80.4|74.02|-7.94%| |HumanEval|0-shot|72.6|68.3|-5.92%| |MBPP|0-shot|72.8|57.93|-20.43%| |GSM8K|8-shot, CoT|84.5|77.71|-8.04%| |MATH|0-shot, CoT|51.9|49.68|-4.28%| |ARC Challenge|0-shot|83.4|74.2|-11.03%| |GPQA|0-shot, CoT|32.8|18.53|-43.51%| ||||**Average**|**-15.68%**| #### Llama 3.2 The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/). |Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)| |---|---|---|---|---|---|---| |MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%| |Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-| |TLDR9+*|test, 1-shot, rougeL|16.8|19|-|-|-| |IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%| |GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%| |MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%| |ARC Challenge|0-shot|59.4|78.5|74.2|+24.92%|-5.48%| |GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%| |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%| |||||**Average**|**+39.42%**|**-3.83%**| \*: We were unable to find an evaluation framework for this benchmark. ### Comparsion to Phi The benchmarks and metrics used are identical to those in the [Phi-3 technical report](https://arxiv.org/abs/2404.14219). |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)| |---|---|---|---|---|---|---|---|---| |MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%| |HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%| |ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%| |GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%| |MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-| |MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%| |AGIEval*|0-shot|37.5|45.1|29.8|-|-|-|-| |TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%| |Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%| |Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%| |PIQA|5-shot|84.2|86.9|60.2|78.29|-7.02%|-9.91%|+30.05%| |SociQA|5-shot|76.6|79.2|68.3|66.73|-12.89%|-15.74%|-2.3%| |BigBench-Hard|3-shot, CoT|71.7|79.1|59.4|48.56|-32.27%|-38.61%|-18.25%| |WinoGrande|5-shot|70.8|81.5|54.7|67.09|-5.24%|-17.68%|+22.65%| |OpenBookQA|10-shot|83.2|88|73.6|87.8|+5.53%|-0.23%|+19.29%| |BoolQ|2-shot|77.2|84.8|-|70.7|-8.42%|-16.63%|-| |CommonSenseQA|10-shot|80.2|80|69.3|71.25|-11.16%|-10.94%|2.81%| |TruthfulQA|10-shot|65|70.2|-|52.07|-19.89%|-25.83%|-| |HumanEval|0-shot|58.5|61|59|68.29|+16.74%|+11.95%|+15.75%| |MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%| |GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-| |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-| ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**| \*: We were unable to find an evaluation framework for this benchmark. ### Comparsion to Gemma #### Gemma 1 & 2 The benchmarks and metrics used are identical to those in the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118). #### Gemma 3 The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786). |Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)| |---|---|---|---|---|---|---| |MMLU-Pro|5-shot|14.7|43.6|-|-|-| |LiveCodeBench*|-|1.9|12.6|-|-|-| |Bird-SQL(dev)\*|-|6.4|36.3|-|-|-| |GPQA Diamond|5-shot|19.2|30.8|31.81|+65.68%|+3.28%| |SimpleQA*|-|2.2|4|-|-|-| |FACTS Grounding*|-|36.4|70.1|-|-|-| |MATH|4-shot|48|75.6|40.2|-16.25%|-46.83%| |HiddenMath*|-|15.8|43|-|-|-| |MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%| |||||**Average**|+24.71%|-8.28%| \*: We were unable to find an evaluation framework for this benchmark.