Submitted by jiuhai 57 BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset · 13 authors 3
Submitted by xiaomoguhzz 38 DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception · 6 authors 3
Submitted by nielsr 34 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures · 15 authors 4
Submitted by scikkk 33 MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning · 11 authors 1
Submitted by toshas 16 Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis · 8 authors 2
Submitted by HanjungKim 13 UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations · 6 authors 2
Submitted by NadMag 9 LightLab: Controlling Light Sources in Images with Diffusion Models · 7 authors 3
Submitted by akhaliq 7 CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image · 9 authors 3
Submitted by novateur 6 WavReward: Spoken Dialogue Models With Generalist Reward Evaluators · 14 authors 3
Submitted by pritamqu 4 VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models · 2 authors 2
Submitted by peihaowang 2 Steepest Descent Density Control for Compact 3D Gaussian Splatting · 11 authors 2
Submitted by kailassrt 2 DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition · 11 authors 2
Submitted by JadeCheng 1 Visually Interpretable Subtask Reasoning for Visual Question Answering · 3 authors 2
Submitted by kkr5155 1 Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA · 4 authors 2