StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs Paper β’ 2505.20139 β’ Published 11 days ago β’ 18
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper β’ 2505.19955 β’ Published 12 days ago β’ 10
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design Paper β’ 2505.16175 β’ Published 16 days ago β’ 39
General-Reasoner: Advancing LLM Reasoning Across All Domains Paper β’ 2505.14652 β’ Published 17 days ago β’ 22
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation Paper β’ 2504.00043 β’ Published Mar 30 β’ 9
Small Models Struggle to Learn from Strong Reasoners Paper β’ 2502.12143 β’ Published Feb 17 β’ 38
ACECODER: Acing Coder RL via Automated Test-Case Synthesis Paper β’ 2502.01718 β’ Published Feb 3 β’ 29
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning Paper β’ 2502.01100 β’ Published Feb 3 β’ 17
Running 548 548 Vision Arena (Testing VLMs side-by-side) πΌ Analyze images to detect and label objects
On Memorization of Large Language Models in Logical Reasoning Paper β’ 2410.23123 β’ Published Oct 30, 2024 β’ 18
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper β’ 2410.10563 β’ Published Oct 14, 2024 β’ 39
Running 548 548 Vision Arena (Testing VLMs side-by-side) πΌ Analyze images to detect and label objects
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies Paper β’ 2308.03188 β’ Published Aug 6, 2023 β’ 2
Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction Paper β’ 2305.13903 β’ Published May 23, 2023
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings Paper β’ 2305.02317 β’ Published May 3, 2023