BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published 10 days ago • 8
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published 24 days ago • 9
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published 24 days ago • 9
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published 24 days ago • 9 • 2