CoMP: Continual Multimodal Pre-training for Vision Foundation Models Paper • 2503.18931 • Published 5 days ago • 29
Long-Context Autoregressive Video Modeling with Next-Frame Prediction Paper • 2503.19325 • Published 4 days ago • 68
Training-free Diffusion Acceleration with Bottleneck Sampling Paper • 2503.18940 • Published 5 days ago • 12
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity Paper • 2503.07677 • Published 19 days ago • 81
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video Paper • 2503.11647 • Published 15 days ago • 123
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation Paper • 2503.09151 • Published 17 days ago • 29
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published 18 days ago • 60
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning Paper • 2503.04812 • Published 25 days ago • 13
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer Paper • 2503.07027 • Published 19 days ago • 26
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published 23 days ago • 86
UniTok: A Unified Tokenizer for Visual Generation and Understanding Paper • 2502.20321 • Published 30 days ago • 29