noneUsername/AM-Thinking-v1-awq

vllm (pretrained=/root/autodl-tmp/AM-Thinking-v1,add_bos_token=true,max_model_len=5096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.792	±	0.0257
		strict-match	5	exact_match	↑	0.780	±	0.0263

vllm (pretrained=/root/autodl-tmp/AM-Thinking-v1,add_bos_token=true,max_model_len=3096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.798	±	0.0180
		strict-match	5	exact_match	↑	0.786	±	0.0184

vllm (pretrained=/root/autodl-tmp/AM-Thinking-v1,add_bos_token=true,max_model_len=3048,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.8023	±	0.0131
- humanities	2	none	acc	↑	0.8154	±	0.0276
- other	2	none	acc	↑	0.8000	±	0.0276
- social sciences	2	none	acc	↑	0.8556	±	0.0255
- stem	2	none	acc	↑	0.7614	±	0.0237

vllm (pretrained=/root/autodl-tmp/AM-Thinking-v1-awq,add_bos_token=true,max_model_len=5096,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.820	±	0.0243
		strict-match	5	exact_match	↑	0.816	±	0.0246

vllm (pretrained=/root/autodl-tmp/AM-Thinking-v1-awq,add_bos_token=true,max_model_len=3096,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.816	±	0.0173
		strict-match	5	exact_match	↑	0.814	±	0.0174

vllm (pretrained=/root/autodl-tmp/AM-Thinking-v1-awq,add_bos_token=true,max_model_len=3048,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7930	±	0.0132
- humanities	2	none	acc	↑	0.8051	±	0.0278
- other	2	none	acc	↑	0.7846	±	0.0277
- social sciences	2	none	acc	↑	0.8444	±	0.0261
- stem	2	none	acc	↑	0.7579	±	0.0242

noneUsername
/

AM-Thinking-v1-awq

Model tree for noneUsername/AM-Thinking-v1-awq