CoMP: Continual Multimodal Pre-training for Vision Foundation Models Paper • 2503.18931 • Published 5 days ago • 29
Long-Context Autoregressive Video Modeling with Next-Frame Prediction Paper • 2503.19325 • Published 4 days ago • 68
Training-free Diffusion Acceleration with Bottleneck Sampling Paper • 2503.18940 • Published 5 days ago • 12
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity Paper • 2503.07677 • Published 19 days ago • 81
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video Paper • 2503.11647 • Published 15 days ago • 123
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation Paper • 2503.09151 • Published 17 days ago • 29
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published 18 days ago • 60
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning Paper • 2503.04812 • Published 25 days ago • 13
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer Paper • 2503.07027 • Published 19 days ago • 26
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published 23 days ago • 86
UniTok: A Unified Tokenizer for Visual Generation and Understanding Paper • 2502.20321 • Published 30 days ago • 29
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks Paper • 2502.17157 • Published Feb 24 • 51
MLGym: A New Framework and Benchmark for Advancing AI Research Agents Paper • 2502.14499 • Published Feb 20 • 188
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20 • 138
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling Paper • 2501.16975 • Published Jan 28 • 28
Diffusion Adversarial Post-Training for One-Step Video Generation Paper • 2501.08316 • Published Jan 14 • 33