LLaVE is a series of large language and vision embedding models trained on a variety of multimodal embedding datasets
-
zhibinlan/LLaVE-0.5B
Image-Text-to-Text • Updated • 33.8k • 7 -
zhibinlan/LLaVE-2B
Image-Text-to-Text • Updated • 22.4k • 45 -
zhibinlan/LLaVE-7B
Image-Text-to-Text • Updated • 709 • 5 -
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Paper • 2503.04812 • Published • 15