Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers Paper • 2506.07986 • Published 2 days ago • 15
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts Paper • 2506.05229 • Published 6 days ago • 37
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding Paper • 2506.01853 • Published 9 days ago • 28
ComposeAnything: Composite Object Priors for Text-to-Image Generation Paper • 2505.24086 • Published 13 days ago • 4
EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering Paper • 2505.24417 • Published 13 days ago • 12
DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling Paper • 2505.11196 • Published 27 days ago • 13
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective Paper • 2505.15045 • Published 22 days ago • 54
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published Dec 5, 2024 • 64
ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration Paper • 2504.08591 • Published Apr 11 • 19
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7 • 79
Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models Paper • 2501.00917 • Published Jan 1 • 1