Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities Paper • 2503.03983 • Published 4 days ago • 18
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published 4 days ago • 64
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions Paper • 2503.03278 • Published 5 days ago • 12
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding Paper • 2503.02951 • Published 5 days ago • 25
ABC: Achieving Better Control of Multimodal Embeddings using VLMs Paper • 2503.00329 • Published 9 days ago • 18
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published 6 days ago • 65
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning Paper • 2502.19634 • Published 11 days ago • 56
Introducing Visual Perception Token into Multimodal Large Language Model Paper • 2502.17425 • Published 13 days ago • 14
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Paper • 2502.18411 • Published 12 days ago • 69
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models Paper • 2502.16033 • Published 16 days ago • 16
Evaluating Multimodal Generative AI with Korean Educational Standards Paper • 2502.15422 • Published 16 days ago • 9
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues Paper • 2502.12084 • Published 20 days ago • 29
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published 17 days ago • 128
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking Paper • 2502.13766 • Published 18 days ago • 3
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning Paper • 2502.11573 • Published 21 days ago • 9
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation Paper • 2502.09838 • Published 24 days ago • 10
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm Paper • 2502.12513 • Published 20 days ago • 15