Questions about using lm_eval to test Jamba on HellaSwag, ARC-Challenge, etc.

by gaomuxuan - opened 8 days ago

8 days ago

I greatly appreciate your work—it’s truly inspiring.

However, when I tried to evaluate the model with lm_eval on datasets such as “lambada_openai”, “hellaswag”, “piqa”, “arc_easy”, “arc_challenge”, and “winogrande”, the results were unexpectedly poor—sometimes even worse than those of the mamba_790M model. I suspect this may be because the model’s “think” behavior is interfering with lm_eval during testing.

Could you share the exact settings and commands you used for evaluating downstream tasks?

gaomuxuan

7 days ago

I noticed you evaluated the model on MMLU-Pro. Could you share the configuration or the script used for the MMLU-Pro evaluation?

AI21Nick

AI21 org 5 days ago

@gaomuxuan our team used Artificial Analysis' evaluation methodology: https://artificialanalysis.ai/methodology/intelligence-benchmarking

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment