Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model Paper • 2407.07053 • Published Jul 9, 2024 • 44
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Paper • 2407.12772 • Published Jul 17, 2024 • 34
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models Paper • 2407.11691 • Published Jul 16, 2024 • 14
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models Paper • 2408.02718 • Published Aug 5, 2024 • 61
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models Paper • 2408.11817 • Published Aug 21, 2024 • 9
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Paper • 2408.13257 • Published Aug 23, 2024 • 26
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios Paper • 2408.17267 • Published Aug 30, 2024 • 23
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images Paper • 2408.16176 • Published Aug 28, 2024 • 8
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark Paper • 2409.02813 • Published Sep 4, 2024 • 29
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? Paper • 2409.07703 • Published Sep 12, 2024 • 67
OmniBench: Towards The Future of Universal Omni-Language Models Paper • 2409.15272 • Published Sep 23, 2024 • 28
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models Paper • 2409.13592 • Published Sep 20, 2024 • 50
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos Paper • 2410.02763 • Published Oct 3, 2024 • 7
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks Paper • 2410.12381 • Published Oct 16, 2024 • 44
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation Paper • 2410.12722 • Published Oct 16, 2024 • 5
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models Paper • 2410.10139 • Published Oct 14, 2024 • 52
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper • 2410.10563 • Published Oct 14, 2024 • 39
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Paper • 2410.10783 • Published Oct 14, 2024 • 27
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Paper • 2410.10818 • Published Oct 14, 2024 • 17
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models Paper • 2410.09733 • Published Oct 13, 2024 • 9
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures Paper • 2410.13754 • Published Oct 17, 2024 • 75
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio Paper • 2410.12787 • Published Oct 16, 2024 • 31
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation Paper • 2410.17250 • Published Oct 22, 2024 • 15
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples Paper • 2410.14669 • Published Oct 18, 2024 • 37
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts Paper • 2410.18071 • Published Oct 23, 2024 • 7
CLEAR: Character Unlearning in Textual and Visual Modalities Paper • 2410.18057 • Published Oct 23, 2024 • 201
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark Paper • 2410.19168 • Published Oct 24, 2024 • 19
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays Paper • 2410.21969 • Published Oct 29, 2024 • 10
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper • 2410.23266 • Published Oct 30, 2024 • 20
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models Paper • 2411.00836 • Published Oct 29, 2024 • 15
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models Paper • 2411.04075 • Published Nov 6, 2024 • 17
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework Paper • 2411.06176 • Published Nov 9, 2024 • 45
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Paper • 2411.07975 • Published Nov 12, 2024 • 30
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Paper • 2411.17451 • Published Nov 26, 2024 • 11
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment Paper • 2411.17188 • Published Nov 26, 2024 • 22
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format Paper • 2411.17991 • Published Nov 27, 2024 • 5
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Paper • 2411.15296 • Published Nov 22, 2024 • 20
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information Paper • 2412.00947 • Published Dec 1, 2024 • 8
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper • 2412.02611 • Published Dec 3, 2024 • 24
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark Paper • 2412.07825 • Published Dec 10, 2024 • 11
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Paper • 2412.07626 • Published Dec 10, 2024 • 22
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Paper • 2412.08737 • Published Dec 11, 2024 • 53
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities Paper • 2412.07769 • Published Dec 10, 2024 • 26
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models Paper • 2412.12606 • Published Dec 17, 2024 • 41
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain Paper • 2412.13018 • Published Dec 17, 2024 • 41
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper • 2412.14171 • Published Dec 18, 2024 • 24
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks Paper • 2412.18072 • Published Dec 24, 2024 • 18
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Paper • 2501.02955 • Published Jan 6 • 40
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper • 2501.05510 • Published Jan 9 • 39
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper • 2501.06186 • Published Jan 10 • 61
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents Paper • 2501.08828 • Published Jan 15 • 30
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos Paper • 2501.09781 • Published Jan 16 • 25
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published about 1 month ago • 82
MSTS: A Multimodal Safety Test Suite for Vision-Language Models Paper • 2501.10057 • Published Jan 17 • 8
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper • 2501.13826 • Published 29 days ago • 24
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding Paper • 2501.16411 • Published 25 days ago • 18
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models Paper • 2502.00698 • Published 19 days ago • 23
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation Paper • 2502.08168 • Published 9 days ago • 10
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents Paper • 2502.09560 • Published 8 days ago • 32
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency Paper • 2502.09621 • Published 8 days ago • 26
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data Paper • 2502.08468 • Published 9 days ago • 12
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models Paper • 2502.09696 • Published 8 days ago • 35
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Paper • 2502.10391 • Published 7 days ago • 29
MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching Paper • 2502.12852 • Published 3 days ago • 3
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking Paper • 2502.13766 • Published 2 days ago • 3