Update README.md
Browse files
README.md
CHANGED
@@ -48,7 +48,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
|
|
48 |
|Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
|
49 |
|---|---|---|---|---|
|
50 |
|MMLU|5-shot|69.4|57.93|-16.53%|
|
51 |
-
|MMLU|0-shot, CoT|73|
|
52 |
|MMLU-Pro|5-shot, CoT|48.3|-|-|
|
53 |
|IFEval|-|80.4|74.02|-7.94%|
|
54 |
|HumanEval|0-shot|72.6|68.3|-5.92%|
|
@@ -57,7 +57,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
|
|
57 |
|MATH|0-shot, CoT|51.9|49.68|-4.28%|
|
58 |
|ARC Challenge|0-shot|83.4|74.2|-11.03%|
|
59 |
|GPQA|0-shot, CoT|32.8|18.53|-43.51%|
|
60 |
-
||||**Average**|**-15.
|
61 |
|
62 |
#### Llama 3.2
|
63 |
The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
|
@@ -115,24 +115,26 @@ The benchmarks and metrics used are identical to those in the [Gemma 2 technical
|
|
115 |
|
116 |
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
117 |
|---|---|---|---|---|---|---|---|---|---|---|
|
118 |
-
|MMLU|5-shot
|
119 |
-
|ARC-C|25-shot
|
120 |
-
|GSM8K|5-shot
|
121 |
-
|AGIEval*|3-5-shot
|
122 |
-
|DROP|3-shot, F1
|
123 |
-
|BBH|3-shot, CoT
|
124 |
-
|Winogrande|5-shot
|
125 |
-
|HellaSwag|10-shot
|
126 |
-
|MATH|4-shot
|
127 |
-
|ARC-e|0-shot
|
128 |
-
|PIQA|0-shot
|
129 |
-
|SIQA|0-shot
|
130 |
-
|Boolq|0-shot
|
131 |
-
|TriviaQA|5-shot
|
132 |
-
|NQ|5-shot
|
133 |
-
|HumanEval|pass@1
|
134 |
-
|MBPP|3-shot
|
135 |
-
|||||||**Average
|
|
|
|
|
136 |
|
137 |
#### Gemma 3
|
138 |
The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|
|
|
48 |
|Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
|
49 |
|---|---|---|---|---|
|
50 |
|MMLU|5-shot|69.4|57.93|-16.53%|
|
51 |
+
|MMLU|0-shot, CoT|73|57.95|-20.62%|
|
52 |
|MMLU-Pro|5-shot, CoT|48.3|-|-|
|
53 |
|IFEval|-|80.4|74.02|-7.94%|
|
54 |
|HumanEval|0-shot|72.6|68.3|-5.92%|
|
|
|
57 |
|MATH|0-shot, CoT|51.9|49.68|-4.28%|
|
58 |
|ARC Challenge|0-shot|83.4|74.2|-11.03%|
|
59 |
|GPQA|0-shot, CoT|32.8|18.53|-43.51%|
|
60 |
+
||||**Average**|**-15.36%**|
|
61 |
|
62 |
#### Llama 3.2
|
63 |
The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
|
|
|
115 |
|
116 |
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
117 |
|---|---|---|---|---|---|---|---|---|---|---|
|
118 |
+
|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
|
119 |
+
|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
|
120 |
+
|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
|
121 |
+
|AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
|
122 |
+
|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
|
123 |
+
|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
|
124 |
+
|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
|
125 |
+
|HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
|
126 |
+
|MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
|
127 |
+
|ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
|
128 |
+
|PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
|
129 |
+
|SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
|
130 |
+
|Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
|
131 |
+
|TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
|
132 |
+
|NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
|
133 |
+
|HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
|
134 |
+
|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
|
135 |
+
|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
|
136 |
+
|
137 |
+
\*: We were unable to find an evaluation framework for this benchmark.
|
138 |
|
139 |
#### Gemma 3
|
140 |
The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|