DocLLM: A layout-aware generative language model for multimodal document understanding Paper • 2401.00908 • Published Dec 31, 2023 • 181
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training Paper • 2401.00849 • Published Jan 1 • 14
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents Paper • 2311.05437 • Published Nov 9, 2023 • 47
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Paper • 2311.00571 • Published Nov 1, 2023 • 40
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model Paper • 2401.02330 • Published Jan 4 • 14
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action Paper • 2312.17172 • Published Dec 28, 2023 • 26
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks Paper • 2206.08916 • Published Jun 17, 2022
Improving fine-grained understanding in image-text pre-training Paper • 2401.09865 • Published Jan 18 • 16
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Paper • 2402.05935 • Published Feb 8 • 15
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling Paper • 2402.06118 • Published Feb 9 • 13
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models Paper • 2402.07865 • Published Feb 12 • 12
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models Paper • 2402.13577 • Published Feb 21 • 7
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20 • 13
TinyLLaVA: A Framework of Small-scale Large Multimodal Models Paper • 2402.14289 • Published Feb 22 • 19
Enhancing Vision-Language Pre-training with Rich Supervisions Paper • 2403.03346 • Published Mar 5 • 14
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper • 2403.11703 • Published Mar 18 • 16
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models Paper • 2403.13447 • Published Mar 20 • 18
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Paper • 2403.14520 • Published Mar 21 • 33
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 51
MoAI: Mixture of All Intelligence for Large Language and Vision Models Paper • 2403.07508 • Published Mar 12 • 75
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Paper • 2403.18814 • Published Mar 27 • 44
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD Paper • 2404.06512 • Published Apr 9 • 29
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 24
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models Paper • 2404.12387 • Published Apr 18 • 38
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation Paper • 2404.14396 • Published Apr 22 • 18
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published Apr 25 • 53
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published Apr 25 • 7
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25 • 35
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published Jun 14 • 54
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published Jun 12 • 28
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published Jun 13 • 19
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus Paper • 2406.08707 • Published Jun 13 • 15
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark Paper • 2406.05967 • Published Jun 10 • 5
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published Jun 17 • 61
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published Jun 17 • 37
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published Jun 17 • 18
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper • 2407.02477 • Published Jul 2 • 21
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24 • 57
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation Paper • 2407.06135 • Published Jul 8 • 20
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model Paper • 2407.07053 • Published Jul 9 • 41
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper • 2407.07895 • Published Jul 10 • 40
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers Paper • 2407.09413 • Published Jul 12 • 9
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model Paper • 2407.16198 • Published Jul 23 • 13
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Paper • 2408.04840 • Published Aug 9 • 31
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 97
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19 • 51
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation Paper • 2408.12528 • Published Aug 22 • 50
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications Paper • 2408.11878 • Published Aug 20 • 50
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22 • 117
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders Paper • 2408.15998 • Published Aug 28 • 83
CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published Aug 29 • 56
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation Paper • 2408.15881 • Published Aug 28 • 20
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture Paper • 2409.02889 • Published Sep 4 • 54
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding Paper • 2409.03420 • Published Sep 5 • 23
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published Sep 18 • 73
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Paper • 2409.12961 • Published Sep 19 • 23
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25 • 99
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Paper • 2409.18125 • Published Sep 26 • 33
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Paper • 2409.20566 • Published Sep 30 • 51
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2 • 25
TLDR: Token-Level Detective Reward Model for Large Vision Language Models Paper • 2410.04734 • Published Oct 7 • 16
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation Paper • 2410.11779 • Published 27 days ago • 24
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning Paper • 2410.06456 • Published Oct 9 • 35
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper • 2410.13848 • Published 25 days ago • 27
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction Paper • 2410.17247 • Published 20 days ago • 43
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation Paper • 2410.13861 • Published 25 days ago • 53
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages Paper • 2410.16153 • Published 21 days ago • 42
DM-Codec: Distilling Multimodal Representations for Speech Tokenization Paper • 2410.15017 • Published 23 days ago • 1
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities Paper • 2410.11190 • Published 28 days ago • 20
Distill Visual Chart Reasoning Ability from LLMs to MLLMs Paper • 2410.18798 • Published 18 days ago • 19
WAFFLE: Multi-Modal Model for Automated Front-End Development Paper • 2410.18362 • Published 19 days ago • 11
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning Paper • 2410.17779 • Published 19 days ago • 7
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Paper • 2410.17434 • Published 20 days ago • 24
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction Paper • 2410.21169 • Published 14 days ago • 29
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks Paper • 2410.19100 • Published 18 days ago • 6
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data Paper • 2410.18558 • Published 18 days ago • 17
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published 12 days ago • 44
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Paper • 2411.04923 • Published 4 days ago • 20