MT-bench scores are awkwardly low for EXAONE-3.5-7.8B-Instruct.
Congratulation on releasing Trillion-7B-preview. It's great to see more open models at this scale.
I'm hyeongu yun @ LG AI Research / EXAONE lab, reviewing your model's performance in various tasks.
I found that MT-bench scores are significantly lower than the results we obtained for LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct.
We immediately re-evaluated MT-Bench for EXAONE-3.5-7.8B-Instruct and LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct five times each with GPT-4o-2024-08-06 as the judge.
Here are our results:
EXAONE-3.5-7.8B-Instruct
gpt-4o-2024-08-06 MT-Bench Results
Run | Average | Writing | Roleplay | Reasoning | Math | Coding | Extraction | STEM | Humanities |
---|---|---|---|---|---|---|---|---|---|
run_1 | 8.35 | 9.00 | 8.35 | 6.40 | 7.65 | 8.30 | 8.80 | 9.25 | 9.05 |
run_2 | 8.34 | 8.85 | 8.35 | 6.20 | 8.00 | 8.10 | 8.85 | 9.25 | 9.10 |
run_3 | 8.38 | 8.90 | 8.50 | 6.35 | 8.15 | 8.10 | 8.70 | 9.20 | 9.15 |
run_4 | 8.36 | 8.90 | 8.50 | 6.40 | 8.00 | 8.10 | 8.75 | 9.20 | 9.05 |
run_5 | 8.39 | 8.95 | 8.50 | 6.25 | 8.30 | 8.10 | 8.75 | 9.20 | 9.10 |
Total Average: 8.37
Std: 0.02
The average of 5-run 8.37 is far from the your reported score 6.75.
for your reference) here's todya's re-evaluted MT-bench score for EXAONE-3.5-2.4B-Instruct:
EXAONE-3.5-2.4B-Instruct
gpt-4o-2024-08-06 MT-Bench Results
Run | Average | Writing | Roleplay | Reasoning | Math | Coding | Extraction | STEM | Humanities |
---|---|---|---|---|---|---|---|---|---|
run_1 | 7.73 | 8.40 | 8.30 | 5.65 | 7.60 | 6.60 | 7.70 | 8.55 | 9.05 |
run_2 | 7.84 | 8.40 | 8.25 | 5.75 | 7.60 | 7.30 | 7.85 | 8.65 | 8.95 |
run_3 | 7.80 | 8.45 | 8.30 | 5.75 | 7.50 | 7.15 | 7.55 | 8.75 | 8.95 |
run_4 | 7.86 | 8.50 | 8.30 | 5.70 | 7.75 | 7.05 | 7.80 | 8.75 | 9.05 |
run_5 | 7.83 | 8.55 | 8.20 | 6.20 | 7.50 | 6.90 | 7.55 | 8.75 | 8.95 |
Total Average: 7.81 | |||||||||
Std: 0.05 |
We deeply understand that scores can tumble and vary by prompt settings, generation configs, or even inference hardware, however, the diff in MT-bench score looks too far.
Will you investigate slightly deeper for this issue? my first guess is that, if you had utilized "https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge", you might miss the correct chat-template for EXAONE-3.5-7.8B-Instruct including system messages.
it would be great to investigate this issue if you post few examples of input-output for the model.
Hi, we’re actively looking into this, and it seems there was an overall “chat-template” problem across our LLM-judge environment, where it wasn’t applied appropriately. We’ll leave a comment when it’s updated. Thank you for notifying us
Thank you!