MotifTech commited on
Commit
7b041df
·
verified ·
1 Parent(s): bd800ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -10
README.md CHANGED
@@ -61,7 +61,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
61
  |MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
62
  |ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
63
  |GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
64
- |AGIEval*|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
65
  |DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
66
  |BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
67
  |Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
@@ -77,8 +77,6 @@ The benchmarks and corresponding scores listed in the table below are taken dire
77
  |MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
78
  |||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
79
 
80
- \*: We were unable to find an evaluation framework for this benchmark.
81
-
82
  #### Gemma 3
83
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
84
 
@@ -132,7 +130,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
132
  |---|---|---|---|---|---|---|
133
  |MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%|
134
  |Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
135
- |TLDR9+*|test, 1-shot, rougeL|16.8|19|-|-|-|
136
  |IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
137
  |GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
138
  |MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
@@ -141,8 +139,6 @@ The benchmarks and corresponding scores listed in the table below are taken dire
141
  |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
142
  |||||**Average**|**+39.42%**|**-3.86%**|
143
 
144
- \*: We were unable to find an evaluation framework for this benchmark.
145
-
146
  ### Comparison to the Phi series by Microsoft
147
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
148
 
@@ -154,7 +150,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
154
  |GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
155
  |MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
156
  |MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
157
- |AGIEval*|0-shot|37.5|45.1|29.8|-|-|-|-|
158
  |TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%|
159
  |Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%|
160
  |Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%|
@@ -172,9 +168,6 @@ The benchmarks and corresponding scores listed in the table below are taken dire
172
  |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
173
  ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
174
 
175
- \*: We were unable to find an evaluation framework for this benchmark.
176
-
177
-
178
  ## Evaluation Appendix
179
 
180
  In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -14.78% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports. However, when compared to the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B shows an average improvement of +18.55% over Llama 3 8B and +1.12% over Gemma 2 9B. See the table below for details.
 
61
  |MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
62
  |ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
63
  |GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
64
+ |AGIEval|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
65
  |DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
66
  |BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
67
  |Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
 
77
  |MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
78
  |||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
79
 
 
 
80
  #### Gemma 3
81
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
82
 
 
130
  |---|---|---|---|---|---|---|
131
  |MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%|
132
  |Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
133
+ |TLDR9+|test, 1-shot, rougeL|16.8|19|-|-|-|
134
  |IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
135
  |GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
136
  |MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
 
139
  |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
140
  |||||**Average**|**+39.42%**|**-3.86%**|
141
 
 
 
142
  ### Comparison to the Phi series by Microsoft
143
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
144
 
 
150
  |GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
151
  |MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
152
  |MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
153
+ |AGIEval|0-shot|37.5|45.1|29.8|-|-|-|-|
154
  |TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%|
155
  |Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%|
156
  |Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%|
 
168
  |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
169
  ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
170
 
 
 
 
171
  ## Evaluation Appendix
172
 
173
  In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -14.78% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports. However, when compared to the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B shows an average improvement of +18.55% over Llama 3 8B and +1.12% over Gemma 2 9B. See the table below for details.