JH-Motif commited on
Commit
8750422
·
verified ·
1 Parent(s): 891d642

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -2
README.md CHANGED
@@ -36,12 +36,13 @@ The benchmarks and metrics used are identical to those in the [Mistral 7B techni
36
  |MBPP|3-shot|47.5|60.3|+26.95%|
37
  |MATH|4-shot, maj@4|13.1|39.2*|+199.24%|
38
  |GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
39
- ||||**Average**|+33.01%|
40
 
41
  \* : We report the 4-shot score instead of the 4-shot, maj@4.
42
 
43
  ### Comparsion to Llama
44
 
 
45
  The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
46
 
47
  |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
@@ -56,4 +57,22 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
56
  |MATH|0-shot, CoT|51.9|49.68|-4.28%|
57
  |ARC Challenge|0-shot|83.4|74.2|-11.03%|
58
  |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
59
- ||||**Average**|-15.68%|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  |MBPP|3-shot|47.5|60.3|+26.95%|
37
  |MATH|4-shot, maj@4|13.1|39.2*|+199.24%|
38
  |GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
39
+ ||||**Average**|**+33.01%**|
40
 
41
  \* : We report the 4-shot score instead of the 4-shot, maj@4.
42
 
43
  ### Comparsion to Llama
44
 
45
+ #### Llama 3
46
  The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
47
 
48
  |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
 
57
  |MATH|0-shot, CoT|51.9|49.68|-4.28%|
58
  |ARC Challenge|0-shot|83.4|74.2|-11.03%|
59
  |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
60
+ ||||**Average**|**-15.68%**|
61
+
62
+ #### Llama 3.2
63
+ The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
64
+
65
+ |Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
66
+ |---|---|---|---|---|---|---|
67
+ |MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%|
68
+ |Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
69
+ |TLDR9+*|test, 1-shot, rougeL|16.8|19|-|-|-|
70
+ |IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
71
+ |GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
72
+ |MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
73
+ |ARC Challenge|0-shot|59.4|78.5|74.2|+24.92%|-5.48%|
74
+ |GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
75
+ |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
76
+ |||||**Average**|**+39.42%**|**-3.83%**|
77
+
78
+ \* We were unable to find an evaluation framework for this benchmark.