view article Article cua-bench: A Framework for Benchmarking, Training Data, and RL Environments for Computer-Use Agents 19 days ago • 10
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization Paper • 2506.18880 • Published Jun 23, 2025 • 4
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? Paper • 2503.19990 • Published Mar 25, 2025 • 35