Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Paper • 2505.04921 • Published May 8 • 185
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7 • 83
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant Paper • 2505.05467 • Published May 8 • 14
Adapting Vision-Language Models Without Labels: A Comprehensive Survey Paper • 2508.05547 • Published 17 days ago • 11
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models Paper • 2508.02095 • Published 20 days ago • 6
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL Paper • 2508.13167 • Published 18 days ago • 103
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations Paper • 2508.09789 • Published 11 days ago • 5
MedSAMix: A Training-Free Model Merging Approach for Medical Image Segmentation Paper • 2508.11032 • Published 9 days ago • 2
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory Paper • 2508.09736 • Published 11 days ago • 50