Update README.md
Browse files
README.md
CHANGED
@@ -113,6 +113,27 @@ The benchmarks and metrics used are identical to those in the [Phi-3 technical r
|
|
113 |
#### Gemma 1 & 2
|
114 |
The benchmarks and metrics used are identical to those in the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
|
115 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
116 |
#### Gemma 3
|
117 |
The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|
118 |
|
@@ -127,6 +148,6 @@ The benchmarks and metrics used are identical to those in the [Gemma 3 technical
|
|
127 |
|MATH|4-shot|48|75.6|40.2|-16.25%|-46.83%|
|
128 |
|HiddenMath*|-|15.8|43|-|-|-|
|
129 |
|MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%|
|
130 |
-
|||||**Average
|
131 |
|
132 |
\*: We were unable to find an evaluation framework for this benchmark.
|
|
|
113 |
#### Gemma 1 & 2
|
114 |
The benchmarks and metrics used are identical to those in the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
|
115 |
|
116 |
+
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
117 |
+
|---|---|---|---|---|---|---|---|---|---|---|
|
118 |
+
|MMLU|5-shot||||||||||
|
119 |
+
|ARC-C|25-shot||||||||||
|
120 |
+
|GSM8K|5-shot||||||||||
|
121 |
+
|AGIEval*|3-5-shot||||||||||
|
122 |
+
|DROP|3-shot, F1||||||||||
|
123 |
+
|BBH|3-shot, CoT||||||||||
|
124 |
+
|Winogrande|5-shot||||||||||
|
125 |
+
|HellaSwag|10-shot||||||||||
|
126 |
+
|MATH|4-shot||||||||||
|
127 |
+
|ARC-e|0-shot||||||||||
|
128 |
+
|PIQA|0-shot||||||||||
|
129 |
+
|SIQA|0-shot||||||||||
|
130 |
+
|Boolq|0-shot||||||||||
|
131 |
+
|TriviaQA|5-shot||||||||||
|
132 |
+
|NQ|5-shot||||||||||
|
133 |
+
|HumanEval|pass@1||||||||||
|
134 |
+
|MBPP|3-shot||||||||||
|
135 |
+
|||||||**Average**|**TBA**|**TBA**|**TBA**|**TBA**|
|
136 |
+
|
137 |
#### Gemma 3
|
138 |
The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|
139 |
|
|
|
148 |
|MATH|4-shot|48|75.6|40.2|-16.25%|-46.83%|
|
149 |
|HiddenMath*|-|15.8|43|-|-|-|
|
150 |
|MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%|
|
151 |
+
|||||**Average**|**+24.71%**|**-8.28%**|
|
152 |
|
153 |
\*: We were unable to find an evaluation framework for this benchmark.
|