Questions about using lm_eval to test Jamba on HellaSwag, ARC-Challenge, etc.

#7
by gaomuxuan - opened

I greatly appreciate your work—it’s truly inspiring.

However, when I tried to evaluate the model with lm_eval on datasets such as “lambada_openai”, “hellaswag”, “piqa”, “arc_easy”, “arc_challenge”, and “winogrande”, the results were unexpectedly poor—sometimes even worse than those of the mamba_790M model. I suspect this may be because the model’s “think” behavior is interfering with lm_eval during testing.

Could you share the exact settings and commands you used for evaluating downstream tasks?

I noticed you evaluated the model on MMLU-Pro. Could you share the configuration or the script used for the MMLU-Pro evaluation?

AI21 org

@gaomuxuan our team used Artificial Analysis' evaluation methodology: https://artificialanalysis.ai/methodology/intelligence-benchmarking

Sign up or log in to comment