base_model:
- LatitudeGames/Muse-12B
vllm (pretrained=/root/autodl-tmp/Muse-12B,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.68 | ± | 0.0296 |
strict-match | 5 | exact_match | ↑ | 0.68 | ± | 0.0296 |
vllm (pretrained=/root/autodl-tmp/Muse-12B,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.678 | ± | 0.0209 |
strict-match | 5 | exact_match | ↑ | 0.676 | ± | 0.0210 |
vllm (pretrained=/root/autodl-tmp/Muse-12B,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.6713 | ± | 0.0150 | |
- humanities | 2 | none | acc | ↑ | 0.7026 | ± | 0.0296 | |
- other | 2 | none | acc | ↑ | 0.6923 | ± | 0.0323 | |
- social sciences | 2 | none | acc | ↑ | 0.7778 | ± | 0.0294 | |
- stem | 2 | none | acc | ↑ | 0.5684 | ± | 0.0279 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-70-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.644 | ± | 0.0303 |
strict-match | 5 | exact_match | ↑ | 0.644 | ± | 0.0303 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-86-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.644 | ± | 0.0303 |
strict-match | 5 | exact_match | ↑ | 0.644 | ± | 0.0303 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.692 | ± | 0.0293 |
strict-match | 5 | exact_match | ↑ | 0.688 | ± | 0.0294 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.668 | ± | 0.0211 |
strict-match | 5 | exact_match | ↑ | 0.664 | ± | 0.0211 |
llm (pretrained=/root/autodl-tmp/Muse-12B-87-128-3096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.6643 | ± | 0.0151 | |
- humanities | 2 | none | acc | ↑ | 0.6872 | ± | 0.0303 | |
- other | 2 | none | acc | ↑ | 0.6872 | ± | 0.0321 | |
- social sciences | 2 | none | acc | ↑ | 0.7667 | ± | 0.0301 | |
- stem | 2 | none | acc | ↑ | 0.5684 | ± | 0.0277 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-87-256-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.672 | ± | 0.0298 |
strict-match | 5 | exact_match | ↑ | 0.676 | ± | 0.0297 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-87-256-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.686 | ± | 0.0208 |
strict-match | 5 | exact_match | ↑ | 0.684 | ± | 0.0208 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-87-256-3096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.6620 | ± | 0.0149 | |
- humanities | 2 | none | acc | ↑ | 0.6821 | ± | 0.0303 | |
- other | 2 | none | acc | ↑ | 0.7026 | ± | 0.0311 | |
- social sciences | 2 | none | acc | ↑ | 0.7667 | ± | 0.0301 | |
- stem | 2 | none | acc | ↑ | 0.5544 | ± | 0.0272 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-875-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.672 | ± | 0.0298 |
strict-match | 5 | exact_match | ↑ | 0.672 | ± | 0.0298 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-875-256-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.704 | ± | 0.0289 |
strict-match | 5 | exact_match | ↑ | 0.708 | ± | 0.0288 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-875-256-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.690 | ± | 0.0207 |
strict-match | 5 | exact_match | ↑ | 0.692 | ± | 0.0207 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-875-256-3096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.6585 | ± | 0.0150 | |
- humanities | 2 | none | acc | ↑ | 0.6974 | ± | 0.0300 | |
- other | 2 | none | acc | ↑ | 0.6718 | ± | 0.0327 | |
- social sciences | 2 | none | acc | ↑ | 0.7833 | ± | 0.0291 | |
- stem | 2 | none | acc | ↑ | 0.5439 | ± | 0.0276 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-876-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.656 | ± | 0.0301 |
strict-match | 5 | exact_match | ↑ | 0.656 | ± | 0.0301 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-88-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.644 | ± | 0.0303 |
strict-match | 5 | exact_match | ↑ | 0.648 | ± | 0.0303 |
vllm (pretrained=/root/autodl-tmp/Muse-12B-90-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.664 | ± | 0.0299 |
strict-match | 5 | exact_match | ↑ | 0.668 | ± | 0.0298 |