OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction Paper • 2503.03734 • Published 7 days ago • 1
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published 2 days ago • 74
MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice Paper • 2503.05978 • Published 5 days ago • 29
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning Paper • 2503.07572 • Published 2 days ago • 20
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models Paper • 2503.08686 • Published 1 day ago • 13
Gemini Embedding: Generalizable Embeddings from Gemini Paper • 2503.07891 • Published 2 days ago • 19
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL Paper • 2503.07536 • Published 2 days ago • 56
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Paper • 2503.08625 • Published 1 day ago • 22
UniF^2ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models Paper • 2503.08120 • Published 1 day ago • 25
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published 1 day ago • 48
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper • 2503.06749 • Published 3 days ago • 20
Automated Movie Generation via Multi-Agent CoT Planning Paper • 2503.07314 • Published 2 days ago • 32
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning Paper • 2503.07365 • Published 2 days ago • 51
SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models Paper • 2503.07605 • Published 2 days ago • 61
Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts Paper • 2503.05447 • Published 5 days ago • 7
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model Paper • 2503.05132 • Published 6 days ago • 42