SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published 13 days ago • 162
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model Paper • 2503.05132 • Published Mar 7 • 55
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language Paper • 2503.23730 • Published 20 days ago • 4
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language Paper • 2503.23730 • Published 20 days ago • 4
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language Paper • 2503.23730 • Published 20 days ago • 4 • 2
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse Paper • 2503.16365 • Published about 1 month ago • 39
mistralai/Mistral-Small-3.1-24B-Instruct-2503 Image-Text-to-Text • Updated 12 days ago • 105k • • 1.13k
microsoft/Phi-4-multimodal-instruct Automatic Speech Recognition • Updated 12 days ago • 620k • 1.31k