JH-Motif commited on
Commit
1c86ca5
·
verified ·
1 Parent(s): 1fe9873

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -20
README.md CHANGED
@@ -48,7 +48,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
48
  |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
49
  |---|---|---|---|---|
50
  |MMLU|5-shot|69.4|57.93|-16.53%|
51
- |MMLU|0-shot, CoT|73|55.9|-23.42%|
52
  |MMLU-Pro|5-shot, CoT|48.3|-|-|
53
  |IFEval|-|80.4|74.02|-7.94%|
54
  |HumanEval|0-shot|72.6|68.3|-5.92%|
@@ -57,7 +57,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
57
  |MATH|0-shot, CoT|51.9|49.68|-4.28%|
58
  |ARC Challenge|0-shot|83.4|74.2|-11.03%|
59
  |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
60
- ||||**Average**|**-15.68%**|
61
 
62
  #### Llama 3.2
63
  The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
@@ -115,24 +115,26 @@ The benchmarks and metrics used are identical to those in the [Gemma 2 technical
115
 
116
  |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
117
  |---|---|---|---|---|---|---|---|---|---|---|
118
- |MMLU|5-shot||||||||||
119
- |ARC-C|25-shot||||||||||
120
- |GSM8K|5-shot||||||||||
121
- |AGIEval*|3-5-shot||||||||||
122
- |DROP|3-shot, F1||||||||||
123
- |BBH|3-shot, CoT||||||||||
124
- |Winogrande|5-shot||||||||||
125
- |HellaSwag|10-shot||||||||||
126
- |MATH|4-shot||||||||||
127
- |ARC-e|0-shot||||||||||
128
- |PIQA|0-shot||||||||||
129
- |SIQA|0-shot||||||||||
130
- |Boolq|0-shot||||||||||
131
- |TriviaQA|5-shot||||||||||
132
- |NQ|5-shot||||||||||
133
- |HumanEval|pass@1||||||||||
134
- |MBPP|3-shot||||||||||
135
- |||||||**Average**|**TBA**|**TBA**|**TBA**|**TBA**|
 
 
136
 
137
  #### Gemma 3
138
  The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
 
48
  |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
49
  |---|---|---|---|---|
50
  |MMLU|5-shot|69.4|57.93|-16.53%|
51
+ |MMLU|0-shot, CoT|73|57.95|-20.62%|
52
  |MMLU-Pro|5-shot, CoT|48.3|-|-|
53
  |IFEval|-|80.4|74.02|-7.94%|
54
  |HumanEval|0-shot|72.6|68.3|-5.92%|
 
57
  |MATH|0-shot, CoT|51.9|49.68|-4.28%|
58
  |ARC Challenge|0-shot|83.4|74.2|-11.03%|
59
  |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
60
+ ||||**Average**|**-15.36%**|
61
 
62
  #### Llama 3.2
63
  The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
 
115
 
116
  |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
117
  |---|---|---|---|---|---|---|---|---|---|---|
118
+ |MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
119
+ |ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
120
+ |GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
121
+ |AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
122
+ |DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
123
+ |BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
124
+ |Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
125
+ |HellaSwag|10-shot|71.7|82.3|72.9|81.9|69.89|-2.52%|-15.08%|-4.13%|-14.66%|
126
+ |MATH|4-shot|11.8|24.3|16|36.6|40.2|+240.88%|+65.43%|+151.25%|+9.84%|
127
+ |ARC-e|0-shot|73.2|81.5|80.6|88|87.21|+19.14%|+7.01%|+8.20%|-0.90%|
128
+ |PIQA|0-shot|77.3|81.2|78.4|81.7|75.95|-1.75%|-6.47%|-3.13%|-7.04%|
129
+ |SIQA|0-shot|49.7|51.8|51.9|53.4|61.97|+24.69%|+19.63%|+19.40%|+16.05%|
130
+ |Boolq|0-shot|69.4|83.2|72.7|84.2|67.76|-2.36%|-18.56%|-6.80%|-19.52%|
131
+ |TriviaQA|5-shot|53.2|63.4|60.4|76.6|54.97|+3.33%|-13.30%|-8.99%|-28.24%|
132
+ |NQ|5-shot|12.5|23|17.1|29.2|10.91|-12.72%|-52.57%|-36.20%|-62.64%|
133
+ |HumanEval|pass@1|22|32.3|20.1|40.2|68.3|+210.45%|+111.46%|+239.80%|+69.90%|
134
+ |MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
135
+ |||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
136
+
137
+ \*: We were unable to find an evaluation framework for this benchmark.
138
 
139
  #### Gemma 3
140
  The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).