MT-bench scores are awkwardly low for EXAONE-3.5-7.8B-Instruct.

#2
by yhg0112 - opened

Congratulation on releasing Trillion-7B-preview. It's great to see more open models at this scale.

I'm hyeongu yun @ LG AI Research / EXAONE lab, reviewing your model's performance in various tasks.

I found that MT-bench scores are significantly lower than the results we obtained for LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct.

We immediately re-evaluated MT-Bench for EXAONE-3.5-7.8B-Instruct and LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct five times each with GPT-4o-2024-08-06 as the judge.

Here are our results:

EXAONE-3.5-7.8B-Instruct

gpt-4o-2024-08-06 MT-Bench Results

Run Average Writing Roleplay Reasoning Math Coding Extraction STEM Humanities
run_1 8.35 9.00 8.35 6.40 7.65 8.30 8.80 9.25 9.05
run_2 8.34 8.85 8.35 6.20 8.00 8.10 8.85 9.25 9.10
run_3 8.38 8.90 8.50 6.35 8.15 8.10 8.70 9.20 9.15
run_4 8.36 8.90 8.50 6.40 8.00 8.10 8.75 9.20 9.05
run_5 8.39 8.95 8.50 6.25 8.30 8.10 8.75 9.20 9.10

Total Average: 8.37
Std: 0.02

The average of 5-run 8.37 is far from the your reported score 6.75.

for your reference) here's todya's re-evaluted MT-bench score for EXAONE-3.5-2.4B-Instruct:

EXAONE-3.5-2.4B-Instruct

gpt-4o-2024-08-06 MT-Bench Results

Run Average Writing Roleplay Reasoning Math Coding Extraction STEM Humanities
run_1 7.73 8.40 8.30 5.65 7.60 6.60 7.70 8.55 9.05
run_2 7.84 8.40 8.25 5.75 7.60 7.30 7.85 8.65 8.95
run_3 7.80 8.45 8.30 5.75 7.50 7.15 7.55 8.75 8.95
run_4 7.86 8.50 8.30 5.70 7.75 7.05 7.80 8.75 9.05
run_5 7.83 8.55 8.20 6.20 7.50 6.90 7.55 8.75 8.95
Total Average: 7.81
Std: 0.05

We deeply understand that scores can tumble and vary by prompt settings, generation configs, or even inference hardware, however, the diff in MT-bench score looks too far.

Will you investigate slightly deeper for this issue? my first guess is that, if you had utilized "https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge", you might miss the correct chat-template for EXAONE-3.5-7.8B-Instruct including system messages.

it would be great to investigate this issue if you post few examples of input-output for the model.

Trillion Labs org

Hi, we’re actively looking into this, and it seems there was an overall “chat-template” problem across our LLM-judge environment, where it wasn’t applied appropriately. We’ll leave a comment when it’s updated. Thank you for notifying us

Trillion Labs org

Hi @yhg0112 , we have updated the scores accordingly.

juyoung-trl changed discussion status to closed

Thank you!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment