Motif-Technologies
/

Motif-2.6B

@@ -48,7 +48,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
 |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
 |---|---|---|---|---|
 |MMLU|5-shot|69.4|57.93|-16.53%|
-|MMLU|0-shot, CoT|73|55.9|-23.42%|
 |MMLU-Pro|5-shot, CoT|48.3|-|-|
 |IFEval|-|80.4|74.02|-7.94%|
 |HumanEval|0-shot|72.6|68.3|-5.92%|
@@ -57,7 +57,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
 |MATH|0-shot, CoT|51.9|49.68|-4.28%|
 |ARC Challenge|0-shot|83.4|74.2|-11.03%|
 |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
-||||**Average**|**-15.68%**|
 #### Llama 3.2
 The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
@@ -115,24 +115,26 @@ The benchmarks and metrics used are identical to those in the [Gemma 2 technical
 |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
 |---|---|---|---|---|---|---|---|---|---|---|
-|MMLU|5-shot||||||||||
-|ARC-C|25-shot||||||||||
-|GSM8K|5-shot||||||||||
-|AGIEval*|3-5-shot||||||||||
-|DROP|3-shot, F1||||||||||
-|BBH|3-shot, CoT||||||||||
-|Winogrande|5-shot||||||||||
-|HellaSwag|10-shot||||||||||
-|MATH|4-shot||||||||||
-|ARC-e|0-shot||||||||||
-|PIQA|0-shot||||||||||
-|SIQA|0-shot||||||||||
-|Boolq|0-shot||||||||||
-|TriviaQA|5-shot||||||||||
-|NQ|5-shot||||||||||
-|HumanEval|pass@1||||||||||
-|MBPP|3-shot||||||||||
-|||||||**Average**|**TBA**|**TBA**|**TBA**|**TBA**|
 #### Gemma 3
 The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).

 |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
 |---|---|---|---|---|
 |MMLU|5-shot|69.4|57.93|-16.53%|
+|MMLU|0-shot, CoT|73|57.95|-20.62%|
 |MMLU-Pro|5-shot, CoT|48.3|-|-|
 |IFEval|-|80.4|74.02|-7.94%|
 |HumanEval|0-shot|72.6|68.3|-5.92%|
 |MATH|0-shot, CoT|51.9|49.68|-4.28%|
 |ARC Challenge|0-shot|83.4|74.2|-11.03%|
 |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
+||||**Average**|**-15.36%**|
 #### Llama 3.2
 The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
 |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
 |---|---|---|---|---|---|---|---|---|---|---|
+|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
+|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
+|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
+|AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
+|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
+|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
+|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
+|HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
+|MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
+|ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
+|PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
+|SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
+|Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
+|TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
+|NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
+|HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
+|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
+|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
+\*: We were unable to find an evaluation framework for this benchmark.
 #### Gemma 3
 The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).