MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Paper • 2408.13257 • Published about 1 month ago • 25
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22 • 110
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19 • 51