Questions about using lm_eval to test Jamba on HellaSwag, ARC-Challenge, etc.
I greatly appreciate your work—it’s truly inspiring.
However, when I tried to evaluate the model with lm_eval on datasets such as “lambada_openai”, “hellaswag”, “piqa”, “arc_easy”, “arc_challenge”, and “winogrande”, the results were unexpectedly poor—sometimes even worse than those of the mamba_790M model. I suspect this may be because the model’s “think” behavior is interfering with lm_eval during testing.
Could you share the exact settings and commands you used for evaluating downstream tasks?
I noticed you evaluated the model on MMLU-Pro. Could you share the configuration or the script used for the MMLU-Pro evaluation?
@gaomuxuan our team used Artificial Analysis' evaluation methodology: https://artificialanalysis.ai/methodology/intelligence-benchmarking