Motif-Technologies
/

Motif-2.6B

Text Generation

text-generation-inference

Model card Files Files and versions

JH-Motif commited on 24 days ago

Commit

8750422

·

verified ·

1 Parent(s): 891d642

Update README.md

Files changed (1) hide show

README.md +21 -2

README.md CHANGED Viewed

@@ -36,12 +36,13 @@ The benchmarks and metrics used are identical to those in the [Mistral 7B techni
 |MBPP|3-shot|47.5|60.3|+26.95%|
 |MATH|4-shot, maj@4|13.1|39.2*|+199.24%|
 |GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
-||||**Average**|+33.01%|
 \* : We report the 4-shot score instead of the 4-shot, maj@4.
 ### Comparsion to Llama
 The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
 |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
@@ -56,4 +57,22 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
 |MATH|0-shot, CoT|51.9|49.68|-4.28%|
 |ARC Challenge|0-shot|83.4|74.2|-11.03%|
 |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
-||||**Average**|-15.68%|

 |MBPP|3-shot|47.5|60.3|+26.95%|
 |MATH|4-shot, maj@4|13.1|39.2*|+199.24%|
 |GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
+||||**Average**|**+33.01%**|
 \* : We report the 4-shot score instead of the 4-shot, maj@4.
 ### Comparsion to Llama
+#### Llama 3
 The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
 |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
 |MATH|0-shot, CoT|51.9|49.68|-4.28%|
 |ARC Challenge|0-shot|83.4|74.2|-11.03%|
 |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
+||||**Average**|**-15.68%**|
+#### Llama 3.2
+The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
+|Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
+|---|---|---|---|---|---|---|
+|MMLU|0-shot|49.3|63.4|57.6|+16.75%|-9.21%|
+|Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
+|TLDR9+*|test, 1-shot, rougeL|16.8|19|-|-|-|
+|IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
+|GSM9K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
+|MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
+|ARC Challenge|0-shot|59.4|78.5|74.2|+24.92%|-5.48%|
+|GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
+|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
+|||||**Average**|**+39.42%**|**-3.83%**|
+\* We were unable to find an evaluation framework for this benchmark.