Vision Language Models: 2025 Update

sergiopaniego 's Collections

Vision reasoning datasets

GUI Grounding datasets

My vision Spaces

👁 Vision comparison ftw

😎 Awesome vision Spaces

Vision Language Models: 2025 Update

updated May 12

This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update

Upvote

Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30 • 137k • 1.76k
Running

340

340

Qwen2.5 Omni 7B Demo

🏆

Generate text and speech responses from various inputs
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 165
openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Jun 20 • 196k • 1.22k
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1 • 112k • 3.48k
Running on Zero

1.99k

1.99k

Chat With Janus-Pro-7B

🌍

A unified multimodal understanding and generation model.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Paper • 2501.17811 • Published Jan 29 • 6
Qwen/QVQ-72B-Preview

Image-Text-to-Text • 73B • Updated Jan 12 • 47.5k • 605
moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated 3 days ago • 4.63k • 436
Running on Zero

165

165

Chat with Kimi-VL-A3B-Thinking-2506

🤔

Chat with images, videos, or PDFs to generate text
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10 • 134
moonshotai/MoonViT-SO-400M

Image Feature Extraction • 0.4B • Updated Apr 17 • 413 • 22
google/siglip-so400m-patch14-384

Zero-Shot Image Classification • 0.9B • Updated Sep 26, 2024 • 3.7M • 585
moonshotai/Kimi-VL-A3B-Instruct

Image-Text-to-Text • 16B • Updated 22 days ago • 213k • 233
HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8 • 76.8k • 534
Running on Zero

142

142

SmolVLM

📊

Generate answers by combining text and images
HuggingFaceTB/SmolVLM2-2.2B-Instruct

Image-Text-to-Text • 2B • Updated Apr 8 • 130k • 241
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 197
Build error

80

80

SmolVLM

📊

Generate answers by combining text and images
google/gemma-3-27b-it

Image-Text-to-Text • 27B • Updated Mar 21 • 510k • • 1.57k
unsloth/gemma-3-27b-it-GGUF

Image-Text-to-Text • 27B • Updated 7 days ago • 51.5k • 151
google/gemma-3-27b-it-qat-q4_0-gguf

Image-Text-to-Text • 27B • Updated Apr 11 • 7.37k • 325
meta-llama/Llama-4-Scout-17B-16E-Instruct

Image-Text-to-Text • 109B • Updated May 22 • 792k • • 1.06k
meta-llama/Llama-4-Maverick-17B-128E-Instruct

Image-Text-to-Text • 402B • Updated May 22 • 51.3k • • 395
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29, 2024 • 54
deepseek-ai/deepseek-vl2

Image-Text-to-Text • 27B • Updated Dec 18, 2024 • 3.69k • 356
Running on Zero

531

531

Chat with DeepSeek-VL2-small

🌍

Generate responses using images and text input
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Paper • 2412.10302 • Published Dec 13, 2024 • 18
lerobot/pi0

Robotics • 4B • Updated Mar 6 • 13.7k • 287
lerobot/pi0fast_base

Robotics • 3B • Updated Mar 31 • 1.02k • 24
nvidia/GR00T-N1-2B

Robotics • 2B • Updated Jul 8 • 1.22k • 327
google/paligemma-3b-pt-224

Image-Text-to-Text • 3B • Updated Sep 21, 2024 • 35.4k • 343
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
Paused

313

313

PaliGemma Demo

🤲

Annotate and describe images with text prompts
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 134
Running on Zero

91

91

Paligemma2 Mix

🌖

Generate text or segment objects from an image
google/paligemma2-10b-mix-448

Image-Text-to-Text • 10B • Updated Feb 7 • 4.28k • 31
allenai/Molmo-72B-0924

Image-Text-to-Text • 73B • Updated Jun 19 • 2.89k • 289
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 122
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6 • 864k • • 529
Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19 • 200
google/shieldgemma-2-4b-it

Image-Text-to-Text • 4B • Updated Apr 4 • 2.74k • 122
ShieldGemma 2: Robust and Tractable Image Content Moderation

Paper • 2504.01081 • Published Apr 1 • 3
Running on Zero

12

12

ShieldGemma2 VLM

📉

Demo for ShieldGemma 2, multimodal safety model
meta-llama/Llama-Guard-4-12B

Image-Text-to-Text • 12B • Updated Apr 29 • 30.4k • • 52
Runtime error

Llama Guard 4

🦀

Check if text and images are safe
Running

259

259

Qwen2.5 VL 72B Instruct

💻

Interact with a multimodal chatbot using text and images
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 4.71k • 56
vidore/colpali-v1.3

Visual Document Retrieval • Updated Mar 14 • 116k • 69
ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 51
vidore/colqwen2.5-v0.2

Visual Document Retrieval • Updated Jun 16 • 36.6k • 67
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14 • 1.48k • 53
Qwen/Qwen2.5-VL-32B-Instruct

Image-Text-to-Text • 33B • Updated Apr 14 • 466k • • 425
Running

147

147

Qwen2.5 VL 32B Instruct Demo

🏃

Chat with images and videos using Qwen
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28 • 267 • 73
Running on Zero

86

86

LongVU

🌖

Generate responses to video or image inputs
openbmb/RLAIF-V-Dataset

Viewer • Updated Mar 4 • 74.8k • 1.35k • 182
HuggingFaceH4/rlaif-v_formatted

Viewer • Updated Jul 2, 2024 • 83.1k • 603 • 12
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Paper • 2404.16006 • Published Apr 24, 2024
Kaining/MMT-Bench

Viewer • Updated Jun 21, 2024 • 30k • 109 • 10
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Paper • 2409.02813 • Published Sep 4, 2024 • 32
MMMU/MMMU_Pro

Viewer • Updated Mar 8 • 5.19k • 5.86k • 32
reducto/RolmOCR

Image-to-Text • 8B • Updated Apr 2 • 121k • 474
Alpha-VLLM/Lumina-mGPT-7B-768

Any-to-Any • 7B • Updated Apr 7 • 626 • 36
facebook/chameleon-7b

Image-Text-to-Text • 7B • Updated Jul 23, 2024 • 57.1k • 188

Upvote

Collection guide
Browse collections

Qwen2.5 Omni 7B Demo

Chat With Janus-Pro-7B

Chat with Kimi-VL-A3B-Thinking-2506

SmolVLM

SmolVLM

Chat with DeepSeek-VL2-small

PaliGemma Demo

Paligemma2 Mix

ShieldGemma2 VLM

Llama Guard 4

Qwen2.5 VL 72B Instruct

Qwen2.5 VL 32B Instruct Demo

LongVU