VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation Paper • 2506.03930 • Published 7 days ago • 22
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? Paper • 2505.15929 • Published 20 days ago • 48
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? Paper • 2404.05955 • Published Apr 9, 2024
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism Paper • 2407.10457 • Published Jul 15, 2024 • 25
AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories Paper • 2410.07706 • Published Oct 10, 2024
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining Paper • 2505.07608 • Published 29 days ago • 79
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20 • 103
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models Paper • 2406.05862 • Published Jun 9, 2024 • 4
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper • 2410.10563 • Published Oct 14, 2024 • 39
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark Paper • 2409.02813 • Published Sep 4, 2024 • 32
MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation Paper • 2406.15252 • Published Jun 21, 2024 • 18
GenAI Arena: An Open Evaluation Platform for Generative Models Paper • 2406.04485 • Published Jun 6, 2024 • 23
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Paper • 2406.01574 • Published Jun 3, 2024 • 47
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents Paper • 2403.02502 • Published Mar 4, 2024 • 3
A Comprehensive Study of Knowledge Editing for Large Language Models Paper • 2401.01286 • Published Jan 2, 2024 • 21
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Paper • 2311.16502 • Published Nov 27, 2023 • 35