Update README.md
Browse files
README.md
CHANGED
@@ -166,13 +166,13 @@ marin = AutoModelForCausalLM.from_pretrained("marin-community/marin-8b-base", re
|
|
166 |
We ran a suite of standard benchmarks to compare our model with [Llama 3.1 8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), and the open source 7-8B models [Olmo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B), and [MAP NEO 7B](https://huggingface.co/m-a-p/neo_7b).
|
167 |
For all benchmarks, we used [LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) with the default setup for each task. (These numbers may differ from reported results due to differences in setup. LM Eval Harness is usually somewhat stricter than other harnesses.)
|
168 |
|
169 |
-
|
|
170 |
-
|
171 |
-
| Marin 8B Base
|
172 |
-
| Llama 3.1 Base
|
173 |
-
| OLMo 2 Base
|
174 |
-
| MAP NEO 7B
|
175 |
-
|
176 |
|
177 |
Marin 8B Base fares well on most of these tasks.
|
178 |
|
|
|
166 |
We ran a suite of standard benchmarks to compare our model with [Llama 3.1 8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), and the open source 7-8B models [Olmo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B), and [MAP NEO 7B](https://huggingface.co/m-a-p/neo_7b).
|
167 |
For all benchmarks, we used [LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) with the default setup for each task. (These numbers may differ from reported results due to differences in setup. LM Eval Harness is usually somewhat stricter than other harnesses.)
|
168 |
|
169 |
+
| Model | Average | AGI Eval LSAT-AR | ARC Challenge | ARC Easy | BBH | BoolQ | CommonSense QA | COPA | GPQA | GSM8K | HellaSwag_1, 10 shot | HellaSwag, 0 shot | lambada_openai | MMLU Pro | MMLU_5shot | MMLU-0shot | OpenBookQA | PIQA | WinoGrande | WSC |
|
170 |
+
|-------|---------|-----------------|---------------|----------|-----|-------|----------------|------|------|-------|---------------------|------------------|---------------|----------|------------|------------|-----------|------|------------|-----|
|
171 |
+
| Marin 8B Base <br/>(Deeper Starling) | **66.6** | 20.9 | **63.1** | **86.5** | **50.6** | **85.9** | 79.1 | **92.0** | 30.3 | 61.3 | **83.6** | **82.3** | **74.7** | **36.5** | **67.6** | **65.9** | 44.2 | **84.4** | **74.5** | 82.1 |
|
172 |
+
| Llama 3.1 Base | 65.3 | 20.4 | 58.9 | 85.8 | 46.4 | 84.2 | 75.2 | **92.0** | **32.3** | 56.8 | 81.9 | 79.4 | **74.7** | 33.3 | 66.4 | 65.5 | 45.8 | 82.9 | 74.4 | 83.5 |
|
173 |
+
| OLMo 2 Base | 64.9 | 17.4 | 60.7 | 85.0 | 44.4 | 85.5 | 75.4 | 89.0 | 26.8 | **67.6** | 81.7 | 80.5 | 73.1 | 30.6 | 63.9 | 61.9 | **46.2** | 82.5 | 74.3 | **86.1** |
|
174 |
+
| MAP NEO 7B | 59.5 | **23.0** | 52.0 | 81.1 | 42.4 | 84.7 | **81.7** | 82.0 | 27.8 | 48.0 | 73.3 | 72.5 | 64.6 | 25.2 | 58.2 | 56.4 | 39.4 | 79.0 | 66.1 | 73.3 |
|
175 |
+
| Amber 7B | 48.1 | 19.1 | 41.6 | 74.7 | 31.6 | 68.8 | 20.6 | 87.0 | 26.3 | 4.4 | 73.9 | 72.4 | 66.8 | 11.6 | 26.6 | 26.7 | 39.2 | 79.8 | 65.3 | 76.9 |
|
176 |
|
177 |
Marin 8B Base fares well on most of these tasks.
|
178 |
|