Update README.md
Browse files
README.md
CHANGED
@@ -36,12 +36,13 @@ The benchmarks and metrics used are identical to those in the [Mistral 7B techni
|
|
36 |
|MBPP|3-shot|47.5|60.3|+26.95%|
|
37 |
|MATH|4-shot, maj@4|13.1|39.2*|+199.24%|
|
38 |
|GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
|
39 |
-
||||**Average
|
40 |
|
41 |
\* : We report the 4-shot score instead of the 4-shot, maj@4.
|
42 |
|
43 |
### Comparsion to Llama
|
44 |
|
|
|
45 |
The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
|
46 |
|
47 |
|Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
|
@@ -56,4 +57,22 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
|
|
56 |
|MATH|0-shot, CoT|51.9|49.68|-4.28%|
|
57 |
|ARC Challenge|0-shot|83.4|74.2|-11.03%|
|
58 |
|GPQA|0-shot, CoT|32.8|18.53|-43.51%|
|
59 |
-
||||**Average
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|MBPP|3-shot|47.5|60.3|+26.95%|
|
37 |
|MATH|4-shot, maj@4|13.1|39.2*|+199.24%|
|
38 |
|GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
|
39 |
+
||||**Average**|**+33.01%**|
|
40 |
|
41 |
\* : We report the 4-shot score instead of the 4-shot, maj@4.
|
42 |
|
43 |
### Comparsion to Llama
|
44 |
|
45 |
+
#### Llama 3
|
46 |
The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
|
47 |
|
48 |
|Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
|
|
|
57 |
|MATH|0-shot, CoT|51.9|49.68|-4.28%|
|
58 |
|ARC Challenge|0-shot|83.4|74.2|-11.03%|
|
59 |
|GPQA|0-shot, CoT|32.8|18.53|-43.51%|
|
60 |
+
||||**Average**|**-15.68%**|
|
61 |
+
|
62 |
+
#### Llama 3.2
|
63 |
+
The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
|
64 |
+
|
65 |
+
|Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
|
66 |
+
|---|---|---|---|---|---|---|
|
67 |
+
|MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%|
|
68 |
+
|Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
|
69 |
+
|TLDR9+*|test, 1-shot, rougeL|16.8|19|-|-|-|
|
70 |
+
|IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
|
71 |
+
|GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
|
72 |
+
|MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
|
73 |
+
|ARC Challenge|0-shot|59.4|78.5|74.2|+24.92%|-5.48%|
|
74 |
+
|GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
|
75 |
+
|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
|
76 |
+
|||||**Average**|**+39.42%**|**-3.83%**|
|
77 |
+
|
78 |
+
\* We were unable to find an evaluation framework for this benchmark.
|