Details of Ability Loss

#1
by noneUsername - opened

Original model:

vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.864 Β± 0.0217
strict-match 5 exact_match ↑ 0.860 Β± 0.0220

vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.868 Β± 0.0152
strict-match 5 exact_match ↑ 0.864 Β± 0.0153

vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.7965 Β± 0.0129
- humanities 2 none acc ↑ 0.8205 Β± 0.0244
- other 2 none acc ↑ 0.8308 Β± 0.0259
- social sciences 2 none acc ↑ 0.8444 Β± 0.0261
- stem 2 none acc ↑ 0.7263 Β± 0.0252

Final W8A8 quantization model:

vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.860 Β± 0.0220
strict-match 5 exact_match ↑ 0.856 Β± 0.0222

vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.85 Β± 0.0160
strict-match 5 exact_match ↑ 0.84 Β± 0.0164

vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.7509 Β± 0.0139
- humanities 2 none acc ↑ 0.7949 Β± 0.0261
- other 2 none acc ↑ 0.7641 Β± 0.0287
- social sciences 2 none acc ↑ 0.8167 Β± 0.0285
- stem 2 none acc ↑ 0.6702 Β± 0.0268

0.860->0.856: ↓0.004(0.05%)
0.864->0.84: ↓0.024(2.8%)
0.7965->0.7509: ↓0.0456(5.73%)

Sign up or log in to comment