Submitted by tellarin 26 Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills · 9 authors 1
Submitted by limuloo1999 21 DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models · 4 authors 2
Submitted by Orannue 16 Edit Transfer: Learning Image Editing via Vision In-Context Relations · 4 authors 2
Submitted by Lingaaaaaaa 12 WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes · 8 authors 1
Submitted by jmhb 10 MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research · 23 authors 1
Submitted by ZhaofengWu 10 reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs · 6 authors 1
Submitted by ZyZcuhk 9 BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing · 9 authors 1
Submitted by lixiaochuan 8 DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation · 13 authors 1
Submitted by akhaliq 4 R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization · 7 authors 1
Submitted by lwpyh 4 V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning · 6 authors 1
Submitted by Luo-Yihong 3 Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation · 5 authors 1
Submitted by k-nick 2 Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework · 8 authors 1
Submitted by JesseTNRoberts 1 Investigating Human-Aligned Large Language Model Uncertainty · 4 authors 1