Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Paper • 2507.07999 • Published 18 days ago • 45
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning Paper • 2506.24119 • Published 28 days ago • 46
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper • 2507.01006 • Published 27 days ago • 200
VMoBA: Mixture-of-Block Attention for Video Diffusion Models Paper • 2506.23858 • Published 29 days ago • 30
CyberV: Cybernetics for Test-time Scaling in Video Understanding Paper • 2506.07971 • Published Jun 9 • 4
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7 • 82
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild Paper • 2504.11326 • Published Apr 15 • 6
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer Paper • 2504.10462 • Published Apr 14 • 15
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding Paper • 2504.10465 • Published Apr 14 • 27
OMG-Seg: Is One Model Good Enough For All Segmentation? Paper • 2401.10229 • Published Jan 18, 2024 • 1
view article Article Finally, a Replacement for BERT: Introducing ModernBERT By bclavie and 14 others • Dec 19, 2024 • 670
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Paper • 2501.04001 • Published Jan 7 • 47
Multimodal Latent Language Modeling with Next-Token Diffusion Paper • 2412.08635 • Published Dec 11, 2024 • 46
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation Paper • 2412.07589 • Published Dec 10, 2024 • 49
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing Paper • 2412.04280 • Published Dec 5, 2024 • 14
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper • 2410.13848 • Published Oct 17, 2024 • 35