LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training Paper β’ 2509.23661 β’ Published Sep 28, 2025 β’ 48 β’ 4
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper β’ 2502.14786 β’ Published Feb 20, 2025 β’ 158 β’ 7
Building and better understanding vision-language models: insights and future directions Paper β’ 2408.12637 β’ Published Aug 22, 2024 β’ 133 β’ 5
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Paper β’ 2311.04257 β’ Published Nov 7, 2023 β’ 22 β’ 2