Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Paper • 2506.05176 • Published 6 days ago • 54
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics Paper • 2506.04308 • Published 6 days ago • 39
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces Paper • 2506.00123 • Published 11 days ago • 32
ZeroGUI: Automating Online GUI Learning at Zero Human Cost Paper • 2505.23762 • Published 12 days ago • 45
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published 23 days ago • 45
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Paper • 2505.07916 • Published 30 days ago • 124
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning Paper • 2505.10557 • Published 26 days ago • 46
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published 27 days ago • 93
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning Paper • 2505.08617 • Published 29 days ago • 41
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning Paper • 2505.07263 • Published 30 days ago • 29
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining Paper • 2505.07608 • Published 30 days ago • 79
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Paper • 2505.04921 • Published May 8 • 176