FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation Paper โข 2410.22257 โข Published Oct 29, 2024
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives Paper โข 2504.10823 โข Published Apr 15 โข 14
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? Paper โข 2504.09702 โข Published Apr 13 โข 18