DMind Benchmark: The First Comprehensive Benchmark for LLM Evaluation in the Web3 Domain Paper • 2504.16116 • Published Apr 18 • 11
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models Paper • 2505.02735 • Published May 5 • 31
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models Paper • 2505.02735 • Published May 5 • 31
Objaverse++: Curated 3D Object Dataset with Quality Annotations Paper • 2504.07334 • Published Apr 9 • 1
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values Paper • 2504.05535 • Published Apr 7 • 44
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values Paper • 2504.05535 • Published Apr 7 • 44
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published Mar 11 • 66
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published Mar 11 • 66
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20 • 103
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20 • 103
MetaOcc: Surround-View 4D Radar and Camera Fusion Framework for 3D Occupancy Prediction with Dual Training Strategies Paper • 2501.15384 • Published Jan 26
OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving Paper • 2412.10734 • Published Dec 14, 2024 • 1
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Paper • 2412.07626 • Published Dec 10, 2024 • 23
KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks Paper • 2410.06526 • Published Oct 9, 2024 • 1