OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement Paper • 2503.17352 • Published Mar 21 • 23
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation Paper • 2503.16660 • Published Mar 20 • 73
CoMP: Continual Multimodal Pre-training for Vision Foundation Models Paper • 2503.18931 • Published 29 days ago • 30
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding Paper • 2503.13964 • Published Mar 18 • 19
ViLBench: A Suite for Vision-Language Process Reward Modeling Paper • 2503.20271 • Published 28 days ago • 7
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme Paper • 2504.02587 • Published 20 days ago • 30
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement Paper • 2504.07934 • Published 12 days ago • 17