umd-vt-nyu/JH_dc-vae-f32c32-sana-1.0-768_patch-1_epoch-64_group-7_fusion_residual_attn Updated 19 days ago
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis Paper • 2505.10046 • Published 28 days ago • 9
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published 28 days ago • 93
PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop Paper • 2503.09595 • Published Mar 12
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published 28 days ago • 93
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness Paper • 2504.10514 • Published Apr 10 • 47
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published Dec 5, 2024 • 64
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24, 2024 • 61
Automated Data Curation for Robust Language Model Fine-Tuning Paper • 2403.12776 • Published Mar 19, 2024
Image Sculpting: Precise Object Editing with 3D Geometry Control Paper • 2401.01702 • Published Jan 2, 2024 • 21
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition Paper • 2203.07996 • Published Feb 24, 2022
Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models Paper • 2211.10950 • Published Nov 20, 2022
Kosmos-G: Generating Images in Context with Multimodal Large Language Models Paper • 2310.02992 • Published Oct 4, 2023 • 4