VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning Paper • 2504.08837 • Published 11 days ago • 41
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation Paper • 2504.08736 • Published 10 days ago • 46
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources Paper • 2504.00595 • Published 20 days ago • 34
ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model Paper • 2503.21144 • Published 25 days ago • 25
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity Paper • 2503.16418 • Published Mar 20 • 35
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14 • 96
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer Paper • 2503.07027 • Published Mar 10 • 28
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control Paper • 2503.05639 • Published Mar 7 • 22
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published Mar 6 • 93
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published Mar 3 • 84
VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing Paper • 2502.17258 • Published Feb 24 • 78
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening Paper • 2502.12146 • Published Feb 17 • 16
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation Paper • 2502.08690 • Published Feb 12 • 43
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Paper • 2502.10391 • Published Feb 14 • 34
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation Paper • 2502.08639 • Published Feb 12 • 43