How can I repeat the eval results?

#2
by bash99 - opened

Should I change some chat template as Qwen3 is default a thinking model?

I'd run lm_eval with vllm 0.8.5 and lm-eval lastest version from git.

Use almost the same scripts in model card. (I've 4090 48g * 2 so I use tensor_parallel_size=2

export CUDA_VISIBLE_DEVICES=0,1
export MODEL=Qwen3-30B-A3B-FP8_dynamic
lm_eval \
  --model vllm \
  --model_args pretrained="$MODEL",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunked_prefill=True,tensor_parallel_size=2 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

But the result I got is:

|Open LLM Leaderboard | N/A| | | | | | | |
| - arc_challenge | 1|none | 25|acc |↑ | 0.6382|Β± |0.0140|
| | |none | 25|acc_norm |↑ | 0.5623|Β± |0.0145|
| - gsm8k | 3|flexible-extract| 5|exact_match|↑ | 0.2146|Β± |0.0113|
| | |strict-match | 5|exact_match|↑ | 0.0061|Β± |0.0021|
| - hellaswag | 1|none | 10|acc |↑ | 0.6301|Β± |0.0048|
| | |none | 10|acc_norm |↑ | 0.7173|Β± |0.0045|
| - mmlu | 2|none | |acc |↑ | 0.4318|Β± |0.0041|
| - truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5571|Β± |0.0154|
| - winogrande | 1|none | 5|acc |↑ | 0.7285|Β± |0.0125|

Red Hat AI org

The discrepancy is likely due to the thinking mode, which is enabled by default. OpenLLM-style evaluations work significantly better when disabling this behavior.

I used this branch from lm-evaluation-harness: https://github.com/neuralmagic/lm-evaluation-harness/tree/enable_thinking, which disables thinking mode by default (although the user can enable it via a vllm argument). I have pushed a PR to the upstream repo, but it hasn't landed yet.

Update I've try with --system_instruction "You are a helpful assistant. /no_think."

At least for gsm8k_platinum_cot I got 0.8776, But for official fp8 https://huggingface.co/Qwen/Qwen3-32B-FP8 I got 0.8983, bf16 version the value is 0.8809

Red Hat AI org

Interesting. Thanks for the update. This level of variability is not uncommon for quantized models.

alexmarques changed discussion status to closed

Sign up or log in to comment