PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models Paper • 2503.12545 • Published 4 days ago • 5
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification Paper • 2503.12505 • Published 4 days ago • 9
ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges Paper • 2503.06553 • Published 12 days ago • 8
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Paper • 2411.18499 • Published Nov 27, 2024 • 18