marin-community
/

marin-8b-base

@@ -166,13 +166,13 @@ marin = AutoModelForCausalLM.from_pretrained("marin-community/marin-8b-base", re
 We ran a suite of standard benchmarks to compare our model with [Llama 3.1 8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), and the open source 7-8B models [Olmo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B), and [MAP NEO 7B](https://huggingface.co/m-a-p/neo_7b).
 For all benchmarks, we used [LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) with the default setup for each task. (These numbers may differ from reported results due to differences in setup. LM Eval Harness is usually somewhat stricter than other harnesses.)
-|                                      | Average  | AGI Eval LSAT-AR | ARC Easy | ARC Challenge | BBH      | BoolQ    | CommonSense QA | COPA     | GPQA     | HellaSwag 0-shot | HellaSwag 10-shot | lambada_openai |  MMLU 5-shot | MMLU 0-shot | MMLU Pro |OpenBookQA | PIQA     | WinoGrande | WSC      |
-|--------------------------------------|----------|------------------|----------|---------------|----------|----------|----------------|----------|----------|------------------|-------------------|----------------|--------------|-------------|----------|-----------|----------|------------|----------|
-| Marin 8B Base <br/>(Deeper Starling) | **68.3** | 20.9             | **86.5** | **63.1**      | **50.6** | **85.9** | 79.1           | **92.0** | 30.3     | **82.3**         | **83.6**          | **74.7**       |  **67.6**    | **65.9**    | **36.5** |44.2       | **84.4** | **74.5**   | 82.1     |
-| Llama 3.1 Base                       | 67.0     | 20.4             | 85.8     | 58.9          | 46.4     | 84.2     | 75.2           | **92.0** | **32.3** | 79.4             | 81.9              | **74.7**       |  66.4        | 65.5        | 33.3     |45.8       | 82.9     | 74.4       | 83.5     |
-| OLMo 2 Base                          | 66.7     | 17.4             | 85.0     | 60.7          | 44.4     | 85.5     | 75.4           | 89.0     | 26.8     | 80.5             | 81.7              | 73.1           |  63.9        | 61.9        | 30.6     |**46.2**   | 82.5     | 74.3       | **86.1** |
-| MAP NEO 7B                           | 62.2     | **23.0**         | 81.1     | 52.0          | 42.4     | 84.7     | **81.7**       | 82.0     | 27.8     | 72.5             | 73.3              | 64.6           |  58.2        | 56.4        | 25.2     |39.4       | 79.0     | 66.1       | 73.3     |
 Marin 8B Base fares well on most of these tasks.

 We ran a suite of standard benchmarks to compare our model with [Llama 3.1 8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), and the open source 7-8B models [Olmo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B), and [MAP NEO 7B](https://huggingface.co/m-a-p/neo_7b).
 For all benchmarks, we used [LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) with the default setup for each task. (These numbers may differ from reported results due to differences in setup. LM Eval Harness is usually somewhat stricter than other harnesses.)
+| Model | Average | AGI Eval LSAT-AR | ARC Challenge | ARC Easy | BBH | BoolQ | CommonSense QA | COPA | GPQA | GSM8K | HellaSwag_1, 10 shot | HellaSwag, 0 shot | lambada_openai | MMLU Pro | MMLU_5shot | MMLU-0shot | OpenBookQA | PIQA | WinoGrande | WSC |
+|-------|---------|-----------------|---------------|----------|-----|-------|----------------|------|------|-------|---------------------|------------------|---------------|----------|------------|------------|-----------|------|------------|-----|
+| Marin 8B Base  <br/>(Deeper Starling) | **66.6** | 20.9 | **63.1** | **86.5** | **50.6** | **85.9** | 79.1 | **92.0** | 30.3 | 61.3 | **83.6** | **82.3** | **74.7** | **36.5** | **67.6** | **65.9** | 44.2 | **84.4** | **74.5** | 82.1 |
+| Llama 3.1 Base | 65.3 | 20.4 | 58.9 | 85.8 | 46.4 | 84.2 | 75.2 | **92.0** | **32.3** | 56.8 | 81.9 | 79.4 | **74.7** | 33.3 | 66.4 | 65.5 | 45.8 | 82.9 | 74.4 | 83.5 |
+| OLMo 2 Base | 64.9 | 17.4 | 60.7 | 85.0 | 44.4 | 85.5 | 75.4 | 89.0 | 26.8 | **67.6** | 81.7 | 80.5 | 73.1 | 30.6 | 63.9 | 61.9 | **46.2** | 82.5 | 74.3 | **86.1** |
+| MAP NEO 7B | 59.5 | **23.0** | 52.0 | 81.1 | 42.4 | 84.7 | **81.7** | 82.0 | 27.8 | 48.0 | 73.3 | 72.5 | 64.6 | 25.2 | 58.2 | 56.4 | 39.4 | 79.0 | 66.1 | 73.3 |
+| Amber 7B | 48.1 | 19.1 | 41.6 | 74.7 | 31.6 | 68.8 | 20.6 | 87.0 | 26.3 | 4.4 | 73.9 | 72.4 | 66.8 | 11.6 | 26.6 | 26.7 | 39.2 | 79.8 | 65.3 | 76.9 |
 Marin 8B Base fares well on most of these tasks.