meta-llama/Llama-4-Scout-17B-16E-Instruct Image-Text-to-Text • Updated about 13 hours ago • 16k • • 453
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization Paper • 2504.00999 • Published 5 days ago • 73
MixerMDM: Learnable Composition of Human Motion Diffusion Models Paper • 2504.01019 • Published 5 days ago • 17
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors Paper • 2504.01016 • Published 5 days ago • 26
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation Paper • 2503.24379 • Published 6 days ago • 68
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization Paper • 2503.19901 • Published 12 days ago • 32
MoCha: Towards Movie-Grade Talking Character Synthesis Paper • 2503.23307 • Published 8 days ago • 93
PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos Paper • 2503.17973 • Published 15 days ago • 7
Long-Context Autoregressive Video Modeling with Next-Frame Prediction Paper • 2503.19325 • Published 13 days ago • 71
Concat-ID: Towards Universal Identity-Preserving Video Synthesis Paper • 2503.14151 • Published 20 days ago • 10
AudioX: Diffusion Transformer for Anything-to-Audio Generation Paper • 2503.10522 • Published 24 days ago • 21
DAPO: An Open-Source LLM Reinforcement Learning System at Scale Paper • 2503.14476 • Published 19 days ago • 113
Edit Transfer: Learning Image Editing via Vision In-Context Relations Paper • 2503.13327 • Published 20 days ago • 28
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation Paper • 2503.06053 • Published 30 days ago • 136
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills Paper • 2503.12533 • Published 22 days ago • 63