Update README.md
Browse files
README.md
CHANGED
@@ -61,7 +61,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
61 |
|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
|
62 |
|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
|
63 |
|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
|
64 |
-
|AGIEval
|
65 |
|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
|
66 |
|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
|
67 |
|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
|
@@ -77,8 +77,6 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
77 |
|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
|
78 |
|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
|
79 |
|
80 |
-
\*: We were unable to find an evaluation framework for this benchmark.
|
81 |
-
|
82 |
#### Gemma 3
|
83 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|
84 |
|
@@ -132,7 +130,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
132 |
|---|---|---|---|---|---|---|
|
133 |
|MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%|
|
134 |
|Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
|
135 |
-
|TLDR9
|
136 |
|IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
|
137 |
|GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
|
138 |
|MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
|
@@ -141,8 +139,6 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
141 |
|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
|
142 |
|||||**Average**|**+39.42%**|**-3.86%**|
|
143 |
|
144 |
-
\*: We were unable to find an evaluation framework for this benchmark.
|
145 |
-
|
146 |
### Comparison to the Phi series by Microsoft
|
147 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
|
148 |
|
@@ -154,7 +150,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
154 |
|GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
|
155 |
|MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
|
156 |
|MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
|
157 |
-
|AGIEval
|
158 |
|TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%|
|
159 |
|Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%|
|
160 |
|Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%|
|
@@ -172,9 +168,6 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
172 |
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
173 |
||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
|
174 |
|
175 |
-
\*: We were unable to find an evaluation framework for this benchmark.
|
176 |
-
|
177 |
-
|
178 |
## Evaluation Appendix
|
179 |
|
180 |
In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -14.78% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports. However, when compared to the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B shows an average improvement of +18.55% over Llama 3 8B and +1.12% over Gemma 2 9B. See the table below for details.
|
|
|
61 |
|MMLU|5-shot|42.3|64.4|52.2|71.3|57.93|+36.95%|-10.05%|+10.98%|-18.75%|
|
62 |
|ARC-C|25-shot|48.5|61.1|55.7|68.4|75.08|+54.80%|+22.88%|+34.79%|+9.77%|
|
63 |
|GSM8K|5-shot|15.1|51.8|24.3|68.6|67.85|+349.34%|+30.98%|+179.22%|-1.09%|
|
64 |
+
|AGIEval|3-5-shot|24.2|44.9|31.5|52.8|-|-|-|-|-|
|
65 |
|DROP|3-shot, F1|48.5|56.3|51.2|69.4|29.33|-39.53%|-47.90%|-42.71%|-57.74%|
|
66 |
|BBH|3-shot, CoT|35.2|59|41.9|68.2|48.56|37.95%|-17.69%|+15.89%|-28.80%|
|
67 |
|Winogrande|5-shot|66.8|79|71.3|80.6|67.09|+0.43%|-15.08%|-5.90%|-16.76%|
|
|
|
77 |
|MBPP|3-shot|29.2|44.4|30.2|52.4|60.3|+106.51%|+35.81%|+99.67%|+15.08%|
|
78 |
|||||||**Average**|**+84.76%**|**+1.69%**|**+42.42%**|**-14.78%**|
|
79 |
|
|
|
|
|
80 |
#### Gemma 3
|
81 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|
82 |
|
|
|
130 |
|---|---|---|---|---|---|---|
|
131 |
|MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%|
|
132 |
|Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
|
133 |
+
|TLDR9+|test, 1-shot, rougeL|16.8|19|-|-|-|
|
134 |
|IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
|
135 |
|GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
|
136 |
|MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
|
|
|
139 |
|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
|
140 |
|||||**Average**|**+39.42%**|**-3.86%**|
|
141 |
|
|
|
|
|
142 |
### Comparison to the Phi series by Microsoft
|
143 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
|
144 |
|
|
|
150 |
|GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
|
151 |
|MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
|
152 |
|MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
|
153 |
+
|AGIEval|0-shot|37.5|45.1|29.8|-|-|-|-|
|
154 |
|TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%|
|
155 |
|Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%|
|
156 |
|Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%|
|
|
|
168 |
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
169 |
||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
|
170 |
|
|
|
|
|
|
|
171 |
## Evaluation Appendix
|
172 |
|
173 |
In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -14.78% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports. However, when compared to the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B shows an average improvement of +18.55% over Llama 3 8B and +1.12% over Gemma 2 9B. See the table below for details.
|