VisionLM - a zerozeyi Collection

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Paper • 2402.05935 • Published Feb 8, 2024 • 17

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Paper • 2402.06118 • Published Feb 9, 2024 • 15

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Paper • 2402.07456 • Published Feb 12, 2024 • 46

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Paper • 2402.07872 • Published Feb 12, 2024 • 16

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Paper • 2402.07865 • Published Feb 12, 2024 • 15

World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13, 2024 • 40

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Paper • 2402.10896 • Published Feb 16, 2024 • 16

FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models

Paper • 2402.10986 • Published Feb 16, 2024 • 80

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Paper • 2402.12226 • Published Feb 19, 2024 • 45

CoLLaVO: Crayon Large Language and Vision mOdel

Paper • 2402.11248 • Published Feb 17, 2024 • 23

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Paper • 2402.11690 • Published Feb 18, 2024 • 10

VideoPrism: A Foundational Visual Encoder for Video Understanding

Paper • 2402.13217 • Published Feb 20, 2024 • 26

Video ReCap: Recursive Captioning of Hour-Long Videos

Paper • 2402.13250 • Published Feb 20, 2024 • 27

A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 15

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Paper • 2402.13220 • Published Feb 20, 2024 • 15

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Paper • 2402.13577 • Published Feb 21, 2024 • 10

PALO: A Polyglot Large Multimodal Model for 5B People

Paper • 2402.14818 • Published Feb 22, 2024 • 25

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22, 2024 • 21

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27, 2024 • 89

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29, 2024 • 35

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 30

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Paper • 2403.01487 • Published Mar 3, 2024 • 16

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

Paper • 2403.02677 • Published Mar 5, 2024 • 18

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

Paper • 2403.02626 • Published Mar 5, 2024 • 11

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Paper • 2403.03194 • Published Mar 5, 2024 • 15

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Paper • 2403.03003 • Published Mar 5, 2024 • 11

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14, 2024 • 128

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 77

Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Paper • 2403.07750 • Published Mar 12, 2024 • 24

DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12, 2024 • 15

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Paper • 2403.06764 • Published Mar 11, 2024 • 29

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 31

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Paper • 2403.05135 • Published Mar 8, 2024 • 46

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paper • 2403.05530 • Published Mar 8, 2024 • 65

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 47

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Paper • 2403.05438 • Published Mar 8, 2024 • 22

Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

Paper • 2403.10301 • Published Mar 15, 2024 • 54

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 36

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Paper • 2403.11703 • Published Mar 18, 2024 • 17

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18, 2024 • 13

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Paper • 2403.12895 • Published Mar 19, 2024 • 33

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

Paper • 2403.12596 • Published Mar 19, 2024 • 11

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Paper • 2403.14624 • Published Mar 21, 2024 • 54

Can large language models explore in-context?

Paper • 2403.15371 • Published Mar 22, 2024 • 34

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 26

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Paper • 2403.15360 • Published Mar 22, 2024 • 13

VidLA: Video-Language Alignment at Scale

Paper • 2403.14870 • Published Mar 21, 2024 • 14

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27, 2024 • 56

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48

sDPO: Don't Use Your Data All at Once

Paper • 2403.19270 • Published Mar 28, 2024 • 42

TextCraftor: Your Text Encoder Can be Image Quality Controller

Paper • 2403.18978 • Published Mar 27, 2024 • 15

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

Paper • 2403.20331 • Published Mar 29, 2024 • 16

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Paper • 2404.01197 • Published Apr 1, 2024 • 32

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Paper • 2404.01258 • Published Apr 1, 2024 • 12

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 29

LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Paper • 2404.03118 • Published Apr 3, 2024 • 27

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Paper • 2404.03653 • Published Apr 4, 2024 • 37

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 83

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Paper • 2404.05726 • Published Apr 8, 2024 • 23

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Paper • 2404.05674 • Published Apr 8, 2024 • 15

Koala: Key frame-conditioned long video-LLM

Paper • 2404.04346 • Published Apr 5, 2024 • 7

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 31

Adapting LLaMA Decoder to Vision Transformer

Paper • 2404.06773 • Published Apr 10, 2024 • 18

BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Paper • 2404.07448 • Published Apr 11, 2024 • 12

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Paper • 2404.07973 • Published Apr 11, 2024 • 33

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

Paper • 2404.09990 • Published Apr 15, 2024 • 13

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Paper • 2404.09204 • Published Apr 14, 2024 • 11

On Speculative Decoding for Multimodal Large Language Models

Paper • 2404.08856 • Published Apr 13, 2024 • 13

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18, 2024 • 40

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 27

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Paper • 2404.14239 • Published Apr 22, 2024 • 9

A Multimodal Automated Interpretability Agent

Paper • 2404.14394 • Published Apr 22, 2024 • 22

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 31

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 32

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24, 2024 • 30

Editable Image Elements for Controllable Synthesis

Paper • 2404.16029 • Published Apr 24, 2024 • 12

MoDE: CLIP Data Experts via Clustering

Paper • 2404.16030 • Published Apr 24, 2024 • 15

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Paper • 2404.16790 • Published Apr 25, 2024 • 9

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 58

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Paper • 2404.16375 • Published Apr 25, 2024 • 18

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25, 2024 • 37

HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

Paper • 2404.16845 • Published Feb 14, 2024 • 7

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

Paper • 2404.17672 • Published Apr 26, 2024 • 20

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Paper • 2404.17521 • Published Apr 26, 2024 • 13

Automatic Creative Selection with Cross-Modal Matching

Paper • 2405.00029 • Published Feb 28, 2024 • 9

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 104

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Paper • 2405.07990 • Published May 13, 2024 • 21

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Paper • 2405.08344 • Published May 14, 2024 • 16

Understanding the performance gap between online and offline alignment algorithms

Paper • 2405.08448 • Published May 14, 2024 • 20

SpeechVerse: A Large-scale Generalizable Audio Language Model

Paper • 2405.08295 • Published May 14, 2024 • 20

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Paper • 2405.08317 • Published May 14, 2024 • 13

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Paper • 2405.09215 • Published May 15, 2024 • 23

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15, 2024 • 89

Many-Shot In-Context Learning in Multimodal Foundation Models

Paper • 2405.09798 • Published May 16, 2024 • 33

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16, 2024 • 131

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16, 2024 • 31

Toon3D: Seeing Cartoons from a New Perspective

Paper • 2405.10320 • Published May 16, 2024 • 23

Octo: An Open-Source Generalist Robot Policy

Paper • 2405.12213 • Published May 20, 2024 • 29

Imp: Highly Capable Large Multimodal Models for Mobile Devices

Paper • 2405.12107 • Published May 20, 2024 • 30

Your Transformer is Secretly Linear

Paper • 2405.12250 • Published May 19, 2024 • 159

Diffusion for World Modeling: Visual Details Matter in Atari

Paper • 2405.12399 • Published May 20, 2024 • 31

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Paper • 2405.14129 • Published May 23, 2024 • 14

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

Paper • 2405.13195 • Published May 21, 2024 • 12

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 56

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Paper • 2405.15216 • Published May 24, 2024 • 17

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90

Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Paper • 2405.17428 • Published May 27, 2024 • 20

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24, 2024 • 47

Dense Connector for MLLMs

Paper • 2405.13800 • Published May 22, 2024 • 25

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Paper • 2405.14598 • Published May 23, 2024 • 14

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30, 2024 • 37

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

Paper • 2405.18669 • Published May 29, 2024 • 12

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30, 2024 • 21

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31, 2024 • 25

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Paper • 2406.00888 • Published Jun 2, 2024 • 34

Parrot: Multilingual Visual Instruction Tuning

Paper • 2406.02539 • Published Jun 4, 2024 • 39

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Paper • 2406.02884 • Published Jun 5, 2024 • 18

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 76

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

Paper • 2406.04151 • Published Jun 6, 2024 • 22

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Paper • 2406.01014 • Published Jun 3, 2024 • 35

Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10, 2024 • 30

An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11, 2024 • 60

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Paper • 2406.06911 • Published Jun 11, 2024 • 12

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published Jun 11, 2024 • 38

What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published Jun 12, 2024 • 42

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Paper • 2406.08407 • Published Jun 12, 2024 • 29

Needle In A Multimodal Haystack

Paper • 2406.07230 • Published Jun 11, 2024 • 55

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Paper • 2406.11839 • Published Jun 17, 2024 • 39

VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17, 2024 • 25

TroL: Traversal of Layers for Large Language and Vision Models

Paper • 2406.12246 • Published Jun 18, 2024 • 36

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18, 2024 • 32

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Paper • 2406.12742 • Published Jun 18, 2024 • 15

Adversarial Attacks on Multimodal Agents

Paper • 2406.12814 • Published Jun 18, 2024 • 4

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Paper • 2406.11230 • Published Jun 17, 2024 • 35

Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models

Paper • 2406.12649 • Published Jun 18, 2024 • 16

Understanding Hallucinations in Diffusion Models through Mode Interpolation

Paper • 2406.09358 • Published Jun 13, 2024 • 5

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Paper • 2406.09356 • Published Jun 13, 2024 • 5

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13, 2024 • 15

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published Jun 13, 2024 • 23

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Paper • 2406.09411 • Published Jun 13, 2024 • 20

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Paper • 2406.08707 • Published Jun 13, 2024 • 17

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Paper • 2406.09162 • Published Jun 13, 2024 • 14

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 30

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Paper • 2406.08451 • Published Jun 12, 2024 • 26

Are We Done with MMLU?

Paper • 2406.04127 • Published Jun 6, 2024 • 39

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

Paper • 2406.06523 • Published Jun 10, 2024 • 53

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Paper • 2406.08487 • Published Jun 12, 2024 • 14

VCR: Visual Caption Restoration

Paper • 2406.06462 • Published Jun 10, 2024 • 13

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 52

OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 40

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Paper • 2406.08552 • Published Jun 12, 2024 • 26

Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Paper • 2406.04338 • Published Jun 6, 2024 • 40

Hibou: A Family of Foundational Vision Transformers for Pathology

Paper • 2406.05074 • Published Jun 7, 2024 • 9

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Paper • 2406.10210 • Published Jun 14, 2024 • 78

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Paper • 2406.08973 • Published Jun 13, 2024 • 90

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Paper • 2406.11833 • Published Jun 17, 2024 • 64

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Paper • 2406.11831 • Published Jun 17, 2024 • 22

From Pixels to Prose: A Large Dataset of Dense Image Captions

Paper • 2406.10328 • Published Jun 14, 2024 • 18

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20, 2024 • 36

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Paper • 2406.11069 • Published Jun 16, 2024 • 14

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Paper • 2406.11271 • Published Jun 17, 2024 • 21

Task Me Anything

Paper • 2406.11775 • Published Jun 17, 2024 • 8

Unifying Multimodal Retrieval via Document Screenshot Embedding

Paper • 2406.11251 • Published Jun 17, 2024 • 10

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15, 2024 • 70

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Paper • 2406.14515 • Published Jun 20, 2024 • 34

Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

Paper • 2406.14035 • Published Jun 20, 2024 • 13

ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

Paper • 2406.14596 • Published Jun 20, 2024 • 5

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Paper • 2406.11403 • Published Jun 17, 2024 • 4

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Paper • 2406.16338 • Published Jun 24, 2024 • 27

Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published Jun 24, 2024 • 34

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24, 2024 • 61

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Paper • 2406.17770 • Published Jun 25, 2024 • 19

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Paper • 2406.15704 • Published Jun 22, 2024 • 5

Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 49

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26, 2024 • 30

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Paper • 2406.15334 • Published Jun 21, 2024 • 9

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Paper • 2406.17294 • Published Jun 25, 2024 • 11

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27, 2024 • 55

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published Jun 26, 2024 • 43

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Paper • 2406.18790 • Published Jun 26, 2024 • 35

Simulating Classroom Education with LLM-Empowered Agents

Paper • 2406.19226 • Published Jun 27, 2024 • 32

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Paper • 2406.10900 • Published Jun 16, 2024 • 11

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Paper • 2406.20095 • Published Jun 28, 2024 • 18

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28, 2024 • 10

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25, 2024 • 8

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Paper • 2407.01284 • Published Jul 1, 2024 • 81

ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

Paper • 2406.19741 • Published Jun 28, 2024 • 63

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Paper • 2407.00468 • Published Jun 29, 2024 • 37

ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 48

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Paper • 2407.00114 • Published Jun 27, 2024 • 13

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 23

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 96

TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2, 2024 • 24

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17, 2024 • 55

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Paper • 2406.08085 • Published Jun 12, 2024 • 17

Granular Privacy Control for Geolocation with Vision Language Models

Paper • 2407.04952 • Published Jul 6, 2024 • 7

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Paper • 2407.06135 • Published Jul 8, 2024 • 23

Multi-Object Hallucination in Vision-Language Models

Paper • 2407.06192 • Published Jul 8, 2024 • 12

Vision language models are blind

Paper • 2407.06581 • Published Jul 9, 2024 • 83

VIMI: Grounding Video Generation through Multi-modal Instruction

Paper • 2407.06304 • Published Jul 8, 2024 • 10

Video-to-Audio Generation with Hidden Alignment

Paper • 2407.07464 • Published Jul 10, 2024 • 17

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Paper • 2407.03958 • Published Jul 4, 2024 • 22

Understanding Visual Feature Reliance through the Lens of Complexity

Paper • 2407.06076 • Published Jul 8, 2024 • 7

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Paper • 2407.06723 • Published Jul 9, 2024 • 11

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 71

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10, 2024 • 43

Do Vision and Language Models Share Concepts? A Vector Space Alignment Study

Paper • 2302.06555 • Published Feb 13, 2023 • 9

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Paper • 2407.08303 • Published Jul 11, 2024 • 19

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Paper • 2407.08583 • Published Jul 11, 2024 • 13

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9, 2024 • 47

E5-V: Universal Embeddings with Multimodal Large Language Models

Paper • 2407.12580 • Published Jul 17, 2024 • 41

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Paper • 2407.12679 • Published Jul 17, 2024 • 8

AUITestAgent: Automatic Requirements Oriented GUI Function Testing

Paper • 2407.09018 • Published Jul 12, 2024 • 5

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Paper • 2407.11298 • Published Jul 16, 2024 • 5

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Paper • 2407.12366 • Published Jul 17, 2024 • 4

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

Paper • 2406.07057 • Published Jun 11, 2024 • 17

EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19, 2024 • 45

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 41

VideoGameBunny: Towards vision assistants for video games

Paper • 2407.15295 • Published Jul 21, 2024 • 22

CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model

Paper • 2407.15233 • Published Jul 21, 2024 • 6

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 30

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Paper • 2407.16655 • Published Jul 23, 2024 • 31

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Paper • 2407.16198 • Published Jul 23, 2024 • 13

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 42

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Paper • 2407.15815 • Published Jul 22, 2024 • 14

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Paper • 2407.17490 • Published Jul 3, 2024 • 32

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Paper • 2407.18121 • Published Jul 25, 2024 • 17

VSSD: Vision Mamba with Non-Casual State Space Duality

Paper • 2407.18559 • Published Jul 26, 2024 • 20

Wolf: Captioning Everything with a World Summarization Framework

Paper • 2407.18908 • Published Jul 26, 2024 • 33

Diffusion Feedback Helps CLIP See Better

Paper • 2407.20171 • Published Jul 29, 2024 • 37

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Paper • 2407.19795 • Published Jul 29, 2024 • 11

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29, 2024 • 37

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper • 2407.21770 • Published Jul 31, 2024 • 23

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Paper • 2407.21646 • Published Jul 31, 2024 • 18

ShieldGemma: Generative AI Content Moderation Based on Gemma

Paper • 2407.21772 • Published Jul 31, 2024 • 14

Open-Vocabulary Audio-Visual Semantic Segmentation

Paper • 2407.21721 • Published Jul 31, 2024 • 9

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 115

OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1, 2024 • 26

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

Paper • 2407.21794 • Published Jul 31, 2024 • 6

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3, 2024 • 83

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Paper • 2408.02657 • Published Aug 5, 2024 • 36

Language Model Can Listen While Speaking

Paper • 2408.02622 • Published Aug 5, 2024 • 42

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Paper • 2408.02210 • Published Aug 5, 2024 • 9

Operationalizing Contextual Integrity in Privacy-Conscious Assistants

Paper • 2408.02373 • Published Aug 5, 2024 • 5

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61

Diffusion Models as Data Mining Tools

Paper • 2408.02752 • Published Jul 20, 2024 • 14

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Paper • 2408.01708 • Published Aug 3, 2024 • 4

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Paper • 2408.03615 • Published Aug 7, 2024 • 32

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Paper • 2408.03695 • Published Aug 7, 2024 • 13

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Paper • 2408.03900 • Published Aug 7, 2024 • 10

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8, 2024 • 27

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Paper • 2408.04594 • Published Aug 8, 2024 • 15

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Paper • 2408.04631 • Published Aug 8, 2024 • 10

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Paper • 2408.05211 • Published Aug 9, 2024 • 50

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Paper • 2408.04840 • Published Aug 9, 2024 • 35

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Paper • 2408.04810 • Published Aug 9, 2024 • 25

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Paper • 2408.06070 • Published Aug 12, 2024 • 54

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Paper • 2408.06327 • Published Aug 12, 2024 • 17

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Paper • 2408.05939 • Published Aug 12, 2024 • 15

Imagen 3

Paper • 2408.07009 • Published Aug 13, 2024 • 62

Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

Paper • 2408.06663 • Published Aug 13, 2024 • 16

DeepSpeak Dataset v1.0

Paper • 2408.05366 • Published Aug 9, 2024 • 14

Towards flexible perception with visual memory

Paper • 2408.08172 • Published Aug 15, 2024 • 24

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 101

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Paper • 2408.08459 • Published Aug 15, 2024 • 46

D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Paper • 2408.08441 • Published Aug 15, 2024 • 8

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 53

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Paper • 2408.11001 • Published Aug 20, 2024 • 12

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

Paper • 2408.10119 • Published Aug 19, 2024 • 17

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 62

NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency

Paper • 2408.11054 • Published Aug 20, 2024 • 13

Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model

Paper • 2408.10764 • Published Aug 20, 2024 • 9

Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

Paper • 2408.10998 • Published Aug 20, 2024 • 9

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Paper • 2408.10487 • Published Aug 20, 2024 • 7

FocusLLM: Scaling LLM's Context by Parallel Decoding

Paper • 2408.11745 • Published Aug 21, 2024 • 26

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Paper • 2408.11318 • Published Aug 21, 2024 • 57

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Paper • 2408.11817 • Published Aug 21, 2024 • 9

FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Paper • 2408.11706 • Published Aug 21, 2024 • 7

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Paper • 2408.11475 • Published Aug 21, 2024 • 18

Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

Paper • 2408.11237 • Published Aug 20, 2024 • 6

Iterative Object Count Optimization for Text-to-image Diffusion Models

Paper • 2408.11721 • Published Aug 21, 2024 • 6

Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22, 2024 • 92

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 52

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 61

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 37

Scalable Autoregressive Image Generation with Mamba

Paper • 2408.12245 • Published Aug 22, 2024 • 27

Real-Time Video Generation with Pyramid Attention Broadcast

Paper • 2408.12588 • Published Aug 22, 2024 • 17

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Paper • 2408.12114 • Published Aug 22, 2024 • 14

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Paper • 2408.09787 • Published Aug 19, 2024 • 10

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 131

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Paper • 2408.13257 • Published Aug 23, 2024 • 27

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Paper • 2408.13239 • Published Aug 23, 2024 • 12

Foundation Models for Music: A Survey

Paper • 2408.14340 • Published Aug 26, 2024 • 45

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

Paper • 2408.13402 • Published Aug 23, 2024 • 18

TVG: A Training-free Transition Video Generation Method with Diffusion Models

Paper • 2408.13413 • Published Aug 24, 2024 • 14

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Paper • 2408.15079 • Published Aug 27, 2024 • 55

Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 96

CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 58

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29, 2024 • 51

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29, 2024 • 54

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Paper • 2408.17253 • Published Aug 30, 2024 • 40

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 53

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Paper • 2409.01071 • Published Sep 2, 2024 • 28

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

Paper • 2409.02095 • Published Sep 3, 2024 • 37

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 35

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4, 2024 • 55

Attention Heads of Large Language Models: A Survey

Paper • 2409.03752 • Published Sep 5, 2024 • 90

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Paper • 2409.04410 • Published Sep 6, 2024 • 26

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Paper • 2409.05840 • Published Sep 9, 2024 • 49

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Paper • 2409.02795 • Published Sep 4, 2024 • 73

POINTS: Improving Your Vision-language Model with Affordable Strategies

Paper • 2409.04828 • Published Sep 7, 2024 • 25

Benchmarking Chinese Knowledge Rectification in Large Language Models

Paper • 2409.05806 • Published Sep 9, 2024 • 15

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10, 2024 • 58

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Paper • 2409.06135 • Published Sep 10, 2024 • 16

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Paper • 2409.06820 • Published Sep 10, 2024 • 70

MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

Paper • 2409.07129 • Published Sep 11, 2024 • 8

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Paper • 2409.07239 • Published Sep 11, 2024 • 15

Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

Paper • 2409.06277 • Published Sep 10, 2024 • 16

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Paper • 2409.09269 • Published Sep 14, 2024 • 9

One missing piece in Vision and Language: A Survey on Comics Understanding

Paper • 2409.09502 • Published Sep 14, 2024 • 26

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 75

OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 115

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17, 2024 • 31

OSV: One Step is Enough for High-Quality Image to Video Generation

Paper • 2409.11367 • Published Sep 17, 2024 • 14

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Paper • 2409.03420 • Published Sep 5, 2024 • 27

InstantDrag: Improving Interactivity in Drag-based Image Editing

Paper • 2409.08857 • Published Sep 13, 2024 • 35

AudioBERT: Audio Knowledge Augmented Language Model

Paper • 2409.08199 • Published Sep 12, 2024 • 5

LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Paper • 2409.08554 • Published Sep 13, 2024 • 2

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 78

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 147

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Paper • 2409.11564 • Published Sep 17, 2024 • 21

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Paper • 2409.12139 • Published Sep 18, 2024 • 12

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19, 2024 • 26

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Paper • 2409.12576 • Published Sep 19, 2024 • 16

Imagine yourself: Tuning-Free Personalized Image Generation

Paper • 2409.13346 • Published Sep 20, 2024 • 71

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Paper • 2409.13592 • Published Sep 20, 2024 • 52

Portrait Video Editing Empowered by Multimodal Generative Priors

Paper • 2409.13591 • Published Sep 20, 2024 • 17

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Paper • 2409.15278 • Published Sep 23, 2024 • 26

Phantom of Latent for Large Language and Vision Models

Paper • 2409.14713 • Published Sep 23, 2024 • 30

Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections

Paper • 2409.14677 • Published Sep 23, 2024 • 16

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Paper • 2409.16160 • Published Sep 24, 2024 • 34

MonoFormer: One Transformer for Both Diffusion and Autoregression

Paper • 2409.16280 • Published Sep 24, 2024 • 18

Seeing Faces in Things: A Model and Dataset for Pareidolia

Paper • 2409.16143 • Published Sep 24, 2024 • 17

Attention Prompting on Image for Large Vision-Language Models

Paper • 2409.17143 • Published Sep 25, 2024 • 7

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 114

MIO: A Foundation Model on Multimodal Tokens

Paper • 2409.17692 • Published Sep 26, 2024 • 54

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30, 2024 • 57

Visual Question Decomposition on Multimodal Large Language Models

Paper • 2409.19339 • Published Sep 28, 2024 • 9

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 38

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 55

LLaVA-Critic: Learning to Evaluate Multimodal Models

Paper • 2410.02712 • Published Oct 3, 2024 • 37

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Paper • 2410.02762 • Published Oct 3, 2024 • 9

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Paper • 2410.02763 • Published Oct 3, 2024 • 7

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 151

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Paper • 2410.04364 • Published Oct 6, 2024 • 30

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Paper • 2410.05243 • Published Oct 7, 2024 • 19

UniMuMo: Unified Text, Music and Motion Generation

Paper • 2410.04534 • Published Oct 6, 2024 • 19

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Paper • 2410.04734 • Published Oct 7, 2024 • 17

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

Paper • 2410.04932 • Published Oct 7, 2024 • 9

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published Oct 2, 2024 • 14

ControlAR: Controllable Image Generation with Autoregressive Models

Paper • 2410.02705 • Published Oct 3, 2024 • 11

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Paper • 2410.03290 • Published Oct 4, 2024 • 7

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8, 2024 • 112

Personalized Visual Instruction Tuning

Paper • 2410.07113 • Published Oct 9, 2024 • 71

Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 66

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Paper • 2410.07171 • Published Oct 9, 2024 • 44

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Paper • 2410.07167 • Published Oct 9, 2024 • 40

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Paper • 2410.06373 • Published Oct 8, 2024 • 34

Pyramidal Flow Matching for Efficient Video Generative Modeling

Paper • 2410.05954 • Published Oct 8, 2024 • 40

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Paper • 2410.05363 • Published Oct 7, 2024 • 46

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Paper • 2410.06244 • Published Oct 8, 2024 • 18

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Paper • 2410.05591 • Published Oct 8, 2024 • 13

Temporal Reasoning Transfer from Text to Video

Paper • 2410.06166 • Published Oct 8, 2024 • 13

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Paper • 2410.03450 • Published Oct 4, 2024 • 37

Intriguing Properties of Large Language and Vision Models

Paper • 2410.04751 • Published Oct 7, 2024 • 16

Progressive Autoregressive Video Diffusion Models

Paper • 2410.08151 • Published Oct 10, 2024 • 16

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Paper • 2410.05210 • Published Oct 7, 2024 • 11

Self-Boosting Large Language Models with Synthetic Preference Data

Paper • 2410.06961 • Published Oct 9, 2024 • 17

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Paper • 2410.07484 • Published Oct 9, 2024 • 51

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10, 2024 • 25

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Paper • 2410.06154 • Published Oct 8, 2024 • 16

Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 88

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9, 2024 • 38

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Paper • 2410.07133 • Published Oct 9, 2024 • 19

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Paper • 2410.10139 • Published Oct 14, 2024 • 53

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14, 2024 • 28

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Paper • 2410.11779 • Published Oct 15, 2024 • 27

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Paper • 2410.10816 • Published Oct 14, 2024 • 21

Improving Long-Text Alignment for Text-to-Image Diffusion Models

Paper • 2410.11817 • Published Oct 15, 2024 • 15

OMCAT: Omni Context Aware Transformer

Paper • 2410.12109 • Published Oct 15, 2024 • 4

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

Paper • 2410.11623 • Published Oct 15, 2024 • 49

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

Paper • 2410.12381 • Published Oct 16, 2024 • 45

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16, 2024 • 32

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17, 2024 • 35

Harnessing Webpage UIs for Text-Rich Visual Understanding

Paper • 2410.13824 • Published Oct 17, 2024 • 32

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Paper • 2410.12705 • Published Oct 16, 2024 • 33

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17, 2024 • 38

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Paper • 2410.13757 • Published Oct 17, 2024 • 33

Roadmap towards Superhuman Speech Understanding using Large Language Models

Paper • 2410.13268 • Published Oct 17, 2024 • 35

Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17, 2024 • 99

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

Paper • 2410.13830 • Published Oct 17, 2024 • 25

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Paper • 2410.13085 • Published Oct 16, 2024 • 23

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Paper • 2410.13639 • Published Oct 17, 2024 • 19

VidPanos: Generative Panoramic Videos from Casual Panning Videos

Paper • 2410.13832 • Published Oct 17, 2024 • 13

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 9

γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Paper • 2410.13859 • Published Oct 17, 2024 • 8

Can MLLMs Understand the Deep Implication Behind Chinese Images?

Paper • 2410.13854 • Published Oct 17, 2024 • 11

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

Paper • 2410.13925 • Published Oct 17, 2024 • 24

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Paper • 2410.11190 • Published Oct 15, 2024 • 22

SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation

Paper • 2410.14745 • Published Oct 17, 2024 • 48

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 69

Baichuan Alignment Technical Report

Paper • 2410.14940 • Published Oct 19, 2024 • 52

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 57

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Paper • 2410.09347 • Published Oct 12, 2024 • 5

AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21, 2024 • 60

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Paper • 2410.16184 • Published Oct 21, 2024 • 24

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Paper • 2410.15316 • Published Oct 20, 2024 • 12

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Paper • 2410.17247 • Published Oct 22, 2024 • 48

Aligning Large Language Models via Self-Steering Optimization

Paper • 2410.17131 • Published Oct 22, 2024 • 23

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 27

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Paper • 2410.16267 • Published Oct 21, 2024 • 18

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Paper • 2410.17637 • Published Oct 23, 2024 • 37

Can Knowledge Editing Really Correct Hallucinations?

Paper • 2410.16251 • Published Oct 21, 2024 • 56

LOGO -- Long cOntext aliGnment via efficient preference Optimization

Paper • 2410.18533 • Published Oct 24, 2024 • 44

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Paper • 2410.18798 • Published Oct 24, 2024 • 21

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Paper • 2410.18558 • Published Oct 24, 2024 • 20

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Paper • 2410.17779 • Published Oct 23, 2024 • 9

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Paper • 2410.17856 • Published Oct 23, 2024 • 52

Continuous Speech Synthesis using per-token Latent Diffusion

Paper • 2410.16048 • Published Oct 21, 2024 • 30

GPT-4o System Card

Paper • 2410.21276 • Published Oct 25, 2024 • 85

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Paper • 2410.21220 • Published Oct 28, 2024 • 10

CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 210

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published Oct 29, 2024 • 10

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Paper • 2410.23287 • Published Oct 30, 2024 • 19

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 51

Personalization of Large Language Models: A Survey

Paper • 2411.00027 • Published Oct 29, 2024 • 35

Randomized Autoregressive Visual Generation

Paper • 2411.00776 • Published Nov 1, 2024 • 18

Face Anonymization Made Simple

Paper • 2411.00762 • Published Nov 1, 2024 • 7

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Paper • 2410.24024 • Published Oct 31, 2024 • 51

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Paper • 2411.02337 • Published Nov 4, 2024 • 38

How Far is Video Generation from World Model: A Physical Law Perspective

Paper • 2411.02385 • Published Nov 4, 2024 • 36

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Paper • 2411.02265 • Published Nov 4, 2024 • 25

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Paper • 2411.02397 • Published Nov 4, 2024 • 24

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

Paper • 2411.02394 • Published Nov 4, 2024 • 17

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Paper • 2411.02359 • Published Nov 4, 2024 • 13

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Paper • 2411.03823 • Published Nov 6, 2024 • 50

Adaptive Length Image Tokenization via Recurrent Allocation

Paper • 2411.02393 • Published Nov 4, 2024 • 13

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Paper • 2411.05003 • Published Nov 7, 2024 • 72

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published Nov 5, 2024 • 27

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Paper • 2411.04952 • Published Nov 7, 2024 • 30

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Paper • 2411.05000 • Published Nov 7, 2024 • 23

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Paper • 2411.04923 • Published Nov 7, 2024 • 23

Analyzing The Language of Visual Tokens

Paper • 2411.05001 • Published Nov 7, 2024 • 25

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 40

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

Paper • 2411.04097 • Published Nov 6, 2024 • 5

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published Nov 11, 2024 • 50

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Paper • 2411.07140 • Published Nov 11, 2024 • 36

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Paper • 2411.07126 • Published Nov 11, 2024 • 31

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Paper • 2411.07232 • Published Nov 11, 2024 • 67

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published Nov 12, 2024 • 31

Autoregressive Models in Vision: A Survey

Paper • 2411.05902 • Published Nov 8, 2024 • 18

MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published Nov 14, 2024 • 78

Sharingan: Extract User Action Sequence from Desktop Recordings

Paper • 2411.08768 • Published Nov 13, 2024 • 10

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 125

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Paper • 2411.06558 • Published Nov 10, 2024 • 37

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

Paper • 2411.10323 • Published Nov 15, 2024 • 35

Number it: Temporal Grounding Videos like Flipping Manga

Paper • 2411.10332 • Published Nov 15, 2024 • 14

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 47

Generative World Explorer

Paper • 2411.11844 • Published Nov 18, 2024 • 78

AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16, 2024 • 25

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Paper • 2411.09944 • Published Nov 15, 2024 • 12

Adaptive Decoding via Latent Preference Optimization

Paper • 2411.09661 • Published Nov 14, 2024 • 10

StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

Paper • 2411.11045 • Published Nov 17, 2024 • 11

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 56

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

Paper • 2411.11909 • Published Nov 17, 2024 • 23

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Paper • 2411.10818 • Published Nov 16, 2024 • 27

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Paper • 2411.12044 • Published Nov 18, 2024 • 14

Continuous Speculative Decoding for Autoregressive Image Generation

Paper • 2411.11925 • Published Nov 18, 2024 • 16

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 80

Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 47

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Paper • 2411.14432 • Published Nov 21, 2024 • 26

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 17

O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

Paper • 2411.16489 • Published Nov 25, 2024 • 49

One Diffusion to Generate Them All

Paper • 2411.16318 • Published Nov 25, 2024 • 31

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Paper • 2411.16657 • Published Nov 25, 2024 • 19

Factorized Visual Tokenization and Generation

Paper • 2411.16681 • Published Nov 25, 2024 • 19

TEXGen: a Generative Diffusion Model for Mesh Textures

Paper • 2411.14740 • Published Nov 22, 2024 • 17

ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 88

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 88

SketchAgent: Language-Driven Sequential Sketch Generation

Paper • 2411.17673 • Published Nov 26, 2024 • 19

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Paper • 2411.17686 • Published Nov 26, 2024 • 21

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Paper • 2411.15296 • Published Nov 22, 2024 • 22

Large Language Model-Brained GUI Agents: A Survey

Paper • 2411.18279 • Published Nov 27, 2024 • 32

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Paper • 2411.17991 • Published Nov 27, 2024 • 5

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Paper • 2411.18203 • Published Nov 27, 2024 • 38

On Domain-Specific Post-Training for Multimodal Large Language Models

Paper • 2411.19930 • Published Nov 29, 2024 • 29

Yi-Lightning Technical Report

Paper • 2412.01253 • Published Dec 2, 2024 • 29

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Paper • 2412.01824 • Published Dec 2, 2024 • 66

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Paper • 2412.00927 • Published Dec 1, 2024 • 28

Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 35

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Paper • 2412.00174 • Published Nov 29, 2024 • 23

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Paper • 2412.00947 • Published Dec 1, 2024 • 8

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published Dec 3, 2024 • 24

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 134

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Paper • 2412.03069 • Published Dec 4, 2024 • 35

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Paper • 2412.00493 • Published Nov 30, 2024 • 17

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Paper • 2412.03565 • Published Dec 4, 2024 • 11

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 111

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 64

NVILA: Efficient Frontier Visual Language Models

Paper • 2412.04468 • Published Dec 5, 2024 • 60

Negative Token Merging: Image-based Adversarial Feature Guidance

Paper • 2412.01339 • Published Dec 2, 2024 • 23

Personalized Multimodal Large Language Models: A Survey

Paper • 2412.02142 • Published Dec 3, 2024 • 14

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Paper • 2412.01169 • Published Dec 2, 2024 • 13

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Paper • 2412.04449 • Published Dec 5, 2024 • 7

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Paper • 2412.03704 • Published Dec 4, 2024 • 7

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 158

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Paper • 2412.05237 • Published Dec 6, 2024 • 48

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

Paper • 2412.04814 • Published Dec 6, 2024 • 49

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Paper • 2412.04301 • Published Dec 5, 2024 • 41

CompCap: Improving Multimodal Large Language Models with Composite Captions

Paper • 2412.05243 • Published Dec 6, 2024 • 19

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Paper • 2412.05263 • Published Dec 6, 2024 • 11

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 14

Training Large Language Models to Reason in a Continuous Latent Space

Paper • 2412.06769 • Published Dec 9, 2024 • 86

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published Dec 9, 2024 • 21

Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published Dec 10, 2024 • 29

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Paper • 2412.04432 • Published Dec 5, 2024 • 16

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Paper • 2412.05939 • Published Dec 8, 2024 • 16

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Paper • 2412.07589 • Published Dec 10, 2024 • 49

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Paper • 2412.03548 • Published Dec 4, 2024 • 17

POINTS1.5: Building a Vision-Language Model towards Real World Applications

Paper • 2412.08443 • Published Dec 11, 2024 • 39

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published Dec 11, 2024 • 46

MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

Paper • 2412.07147 • Published Dec 10, 2024 • 5

StreamChat: Chatting with Streaming Video

Paper • 2412.08646 • Published Dec 11, 2024 • 18

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 99

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published Dec 11, 2024 • 54

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Paper • 2412.09501 • Published Dec 12, 2024 • 49

Multimodal Latent Language Modeling with Next-Token Diffusion

Paper • 2412.08635 • Published Dec 11, 2024 • 46

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Paper • 2412.09618 • Published Dec 12, 2024 • 21

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Paper • 2412.08687 • Published Dec 11, 2024 • 13

Arbitrary-steps Image Super-resolution via Diffusion Inversion

Paper • 2412.09013 • Published Dec 12, 2024 • 13

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 146

GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 97

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Paper • 2412.09428 • Published Dec 12, 2024 • 7

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 38

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 103

BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 36

VidTok: A Versatile and Open-Source Video Tokenizer

Paper • 2412.13061 • Published Dec 17, 2024 • 8

GUI Agents: A Survey

Paper • 2412.13501 • Published Dec 18, 2024 • 29

Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published Dec 19, 2024 • 74

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published Dec 19, 2024 • 55

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Paper • 2412.14233 • Published Dec 18, 2024 • 6

Large Motion Video Autoencoding with Cross-modal Video VAE

Paper • 2412.17805 • Published Dec 23, 2024 • 24

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Paper • 2412.17295 • Published Dec 23, 2024 • 9

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 29

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

Paper • 2412.14462 • Published Dec 19, 2024 • 15

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Paper • 2412.15191 • Published Dec 19, 2024 • 5

Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published Dec 19, 2024 • 54

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Paper • 2412.15322 • Published Dec 19, 2024 • 18

Sequence Matters: Harnessing Video Models in 3D Super-Resolution

Paper • 2412.11525 • Published Dec 16, 2024 • 11

Diving into Self-Evolving Training for Multimodal Reasoning

Paper • 2412.17451 • Published Dec 23, 2024 • 44

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

Paper • 2412.17153 • Published Dec 22, 2024 • 39

NILE: Internal Consistency Alignment in Large Language Models

Paper • 2412.16686 • Published Dec 21, 2024 • 8

DepthLab: From Partial to Complete

Paper • 2412.18153 • Published Dec 24, 2024 • 37

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Paper • 2412.18450 • Published Dec 24, 2024 • 37

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Paper • 2412.17739 • Published Dec 23, 2024 • 42

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Paper • 2412.18597 • Published Dec 24, 2024 • 19

How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Paper • 2412.18495 • Published Dec 24, 2024 • 9

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Paper • 2412.18609 • Published Dec 24, 2024 • 18

Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published Dec 19, 2024 • 9

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 40

YuLan-Mini: An Open Data-efficient Language Model

Paper • 2412.17743 • Published Dec 23, 2024 • 67

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published Dec 24, 2024 • 20

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

Paper • 2412.18176 • Published Dec 24, 2024 • 16

1.58-bit FLUX

Paper • 2412.18653 • Published Dec 24, 2024 • 84

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Paper • 2412.18619 • Published Dec 16, 2024 • 59

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Paper • 2412.19326 • Published Dec 26, 2024 • 18

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Paper • 2412.19512 • Published Dec 27, 2024 • 8

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Paper • 2412.18525 • Published Dec 24, 2024 • 76

Edicho: Consistent Image Editing in the Wild

Paper • 2412.21079 • Published Dec 30, 2024 • 23

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Paper • 2412.21037 • Published Dec 30, 2024 • 24

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Paper • 2412.20750 • Published Dec 30, 2024 • 20

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Paper • 2501.01427 • Published Jan 2 • 55

LTX-Video: Realtime Video Latent Diffusion

Paper • 2501.00103 • Published Dec 30, 2024 • 47

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published Dec 31, 2024 • 48

MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 31

A3: Android Agent Arena for Mobile GUI Agents

Paper • 2501.01149 • Published Jan 2 • 22

Unifying Specialized Visual Encoders for Video Language Models

Paper • 2501.01426 • Published Jan 2 • 21

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Paper • 2501.01957 • Published Jan 3 • 47

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 53

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6 • 45

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 78

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Paper • 2501.03262 • Published Jan 4 • 99

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 46

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

Paper • 2501.04561 • Published Jan 8 • 16

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Paper • 2501.04575 • Published Jan 8 • 24

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Paper • 2501.05366 • Published Jan 9 • 101

DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization

Paper • 2501.03271 • Published Jan 5 • 11

The GAN is dead; long live the GAN! A Modern GAN Baseline

Paper • 2501.05441 • Published Jan 9 • 92

Enhancing Human-Like Responses in Large Language Models

Paper • 2501.05032 • Published Jan 9 • 57

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9 • 42

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Paper • 2501.05122 • Published Jan 9 • 20

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

Paper • 2501.04377 • Published Jan 8 • 14

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 72

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10 • 66

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10 • 30

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Paper • 2501.05510 • Published Jan 9 • 44

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Paper • 2501.06282 • Published Jan 10 • 52

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Paper • 2501.08326 • Published Jan 14 • 35

MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13 • 6

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Paper • 2501.08828 • Published Jan 15 • 32

RepVideo: Rethinking Cross-Layer Representation for Video Generation

Paper • 2501.08994 • Published Jan 15 • 15

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Paper • 2501.05452 • Published Jan 9 • 15

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Paper • 2501.05707 • Published Jan 10 • 20

VideoAuteur: Towards Long Narrative Video Generation

Paper • 2501.06173 • Published Jan 10 • 34

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

Paper • 2501.06842 • Published Jan 12 • 16

Evaluating Sample Utility for Data Selection by Mimicking Model Weights

Paper • 2501.06708 • Published Jan 12 • 5

MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14 • 293

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Paper • 2501.07730 • Published Jan 13 • 17

HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

Paper • 2501.08292 • Published Jan 14 • 17

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Paper • 2501.07888 • Published Jan 14 • 15

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Paper • 2501.08197 • Published Jan 14 • 8

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14 • 7

MINIMA: Modality Invariant Image Matching

Paper • 2412.19412 • Published Dec 27, 2024 • 4

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Paper • 2501.09751 • Published Jan 16 • 49

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Paper • 2501.09755 • Published Jan 16 • 37

Do generative video models learn physical principles from watching videos?

Paper • 2501.09038 • Published Jan 14 • 35

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Paper • 2501.09747 • Published Jan 16 • 24

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Paper • 2501.09781 • Published Jan 16 • 29

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published Jan 21 • 86

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Paper • 2501.11733 • Published Jan 20 • 29

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Paper • 2501.13926 • Published Jan 23 • 42

Baichuan-Omni-1.5 Technical Report

Paper • 2501.15368 • Published Jan 26 • 64

Qwen2.5-1M Technical Report

Paper • 2501.15383 • Published Jan 26 • 71

Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published Jan 27 • 30

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Paper • 2501.15907 • Published Jan 27 • 17

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

Paper • 2403.09193 • Published Mar 14, 2024 • 9

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28 • 122

PixelWorld: Towards Perceiving Everything as Pixels

Paper • 2501.19339 • Published Jan 31 • 17

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Paper • 2502.01061 • Published Feb 3 • 216

Process Reinforcement through Implicit Rewards

Paper • 2502.01456 • Published Feb 3 • 62

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published Feb 3 • 39

Humanity's Last Exam

Paper • 2501.14249 • Published Jan 24 • 75

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Paper • 2501.13106 • Published Jan 22 • 91

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Paper • 2501.12599 • Published Jan 22 • 117

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Paper • 2501.12895 • Published Jan 22 • 63

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22 • 399

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

Paper • 2502.03275 • Published Feb 5 • 18

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Paper • 2502.03032 • Published Feb 5 • 61

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

Paper • 2502.04328 • Published Feb 6 • 30

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Paper • 2502.05173 • Published Feb 7 • 65

Fast Video Generation with Sliding Tile Attention

Paper • 2502.04507 • Published Feb 6 • 52

Goku: Flow Based Video Generative Foundation Models

Paper • 2502.04896 • Published Feb 7 • 104

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Paper • 2502.05171 • Published Feb 7 • 142

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Paper • 2502.05178 • Published Feb 7 • 10

On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

Paper • 2502.04363 • Published Feb 5 • 12

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Paper • 2502.06703 • Published Feb 10 • 153

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Paper • 2502.07617 • Published Feb 11 • 29

Expect the Unexpected: FailSafe Long Context QA for Finance

Paper • 2502.06329 • Published Feb 10 • 132

Magic 1-For-1: Generating One Minute Video Clips within One Minute

Paper • 2502.07701 • Published Feb 11 • 36

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Paper • 2502.08590 • Published Feb 12 • 44

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Paper • 2502.07870 • Published Feb 11 • 45

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Paper • 2502.08047 • Published Feb 12 • 28

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11 • 54

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Paper • 2502.08468 • Published Feb 12 • 14

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Paper • 2502.08946 • Published Feb 13 • 194

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Paper • 2502.08690 • Published Feb 12 • 44

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Paper • 2502.09560 • Published Feb 13 • 36

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Paper • 2502.09696 • Published Feb 13 • 44

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Paper • 2502.10248 • Published Feb 14 • 56

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Paper • 2502.10391 • Published Feb 14 • 35

Large Language Diffusion Models

Paper • 2502.09992 • Published Feb 14 • 119

Learning Getting-Up Policies for Real-World Humanoid Robots

Paper • 2502.12152 • Published Feb 17 • 42

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Paper • 2502.11089 • Published Feb 16 • 159

How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training

Paper • 2502.11196 • Published Feb 16 • 22

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Paper • 2502.10458 • Published Feb 12 • 36

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Paper • 2502.12148 • Published Feb 17 • 16

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Paper • 2502.11831 • Published Feb 17 • 19

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17 • 8

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Paper • 2502.08826 • Published Feb 12 • 17

ILIAS: Instance-Level Image retrieval At Scale

Paper • 2502.11748 • Published Feb 17 • 4

Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18 • 86

Continuous Diffusion Model for Language Modeling

Paper • 2502.11564 • Published Feb 17 • 54

Phantom: Subject-consistent video generation via cross-modal alignment

Paper • 2502.11079 • Published Feb 16 • 60

Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18 • 58

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Paper • 2502.12115 • Published Feb 17 • 45

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Paper • 2502.13145 • Published Feb 18 • 38

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Paper • 2502.12513 • Published Feb 18 • 16

Harnessing Vision Models for Time Series Analysis: A Survey

Paper • 2502.08869 • Published Feb 13 • 2

Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19 • 189

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

Paper • 2502.14296 • Published Feb 20 • 46

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20 • 144

How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?

Paper • 2502.14502 • Published Feb 20 • 91

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

Paper • 2502.14834 • Published Feb 20 • 24

Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information

Paper • 2502.14258 • Published Feb 20 • 26

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

Paper • 2502.14282 • Published Feb 20 • 20

How to Get Your LLM to Generate Challenging Problems for Evaluation

Paper • 2502.14678 • Published Feb 20 • 18

Dynamic Concepts Personalization from Single Videos

Paper • 2502.14844 • Published Feb 20 • 16

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Paper • 2502.14846 • Published Feb 20 • 13

NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

Paper • 2502.14638 • Published Feb 20 • 11

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Paper • 2502.14802 • Published Feb 20 • 13

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Paper • 2502.13063 • Published Feb 18 • 72

VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

Paper • 2502.12084 • Published Feb 17 • 30

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Paper • 2502.15007 • Published Feb 20 • 175

SurveyX: Academic Survey Automation via Large Language Models

Paper • 2502.14776 • Published Feb 20 • 100

PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data

Paper • 2502.14397 • Published Feb 20 • 42

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

Paper • 2502.17157 • Published Feb 24 • 53

Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Paper • 2502.16033 • Published Feb 22 • 18

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

Paper • 2502.16707 • Published Feb 23 • 13

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Paper • 2502.18411 • Published Feb 25 • 73

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Paper • 2502.18137 • Published Feb 25 • 57

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Paper • 2502.18449 • Published Feb 25 • 74

KV-Edit: Training-Free Image Editing for Precise Background Preservation

Paper • 2502.17363 • Published Feb 24 • 37

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

Paper • 2502.18364 • Published Feb 25 • 36

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

Paper • 2502.18461 • Published Feb 25 • 15

Introducing Visual Perception Token into Multimodal Large Language Model

Paper • 2502.17425 • Published Feb 24 • 15

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Paper • 2502.17422 • Published Feb 24 • 7

LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

Paper • 2502.18302 • Published Feb 25 • 5

Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

Paper • 2502.17092 • Published Feb 24 • 3

TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

Paper • 2502.19400 • Published Feb 26 • 49

Towards an AI co-scientist

Paper • 2502.18864 • Published Feb 26 • 50

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

Paper • 2502.19634 • Published Feb 26 • 63

UniTok: A Unified Tokenizer for Visual Generation and Understanding

Paper • 2502.20321 • Published Feb 27 • 30

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

Paper • 2502.20172 • Published Feb 27 • 28

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Paper • 2502.20811 • Published Feb 28 • 2

Chain of Draft: Thinking Faster by Writing Less

Paper • 2502.18600 • Published Feb 25 • 49

Tell me why: Visual foundation models as self-explainable classifiers

Paper • 2502.19577 • Published Feb 26 • 11

SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

Paper • 2502.20545 • Published Feb 27 • 22

MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing

Paper • 2502.21291 • Published Feb 28 • 5

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Paper • 2503.00808 • Published Mar 2 • 57

Visual-RFT: Visual Reinforcement Fine-Tuning

Paper • 2503.01785 • Published Mar 3 • 79

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Paper • 2503.01743 • Published Mar 3 • 87

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

Paper • 2503.00501 • Published Mar 1 • 12

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 122

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

Paper • 2503.03651 • Published Mar 5 • 16

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Paper • 2503.01342 • Published Mar 3 • 8

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Paper • 2502.18890 • Published Feb 26 • 30

HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Paper • 2503.02003 • Published Mar 3 • 48

Process-based Self-Rewarding Language Models

Paper • 2503.03746 • Published Mar 5 • 40

CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs

Paper • 2503.01378 • Published Mar 3 • 3

Token-Efficient Long Video Understanding for Multimodal LLMs

Paper • 2503.04130 • Published Mar 6 • 94

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Paper • 2503.04724 • Published Mar 6 • 70

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper • 2503.03983 • Published Mar 6 • 24

How to Steer LLM Latents for Hallucination Detection?

Paper • 2503.01917 • Published Mar 1 • 11

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Paper • 2503.04606 • Published Mar 6 • 9

Unified Reward Model for Multimodal Understanding and Generation

Paper • 2503.05236 • Published Mar 7 • 124

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Paper • 2503.05132 • Published Mar 7 • 58

Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Paper • 2503.05179 • Published Mar 7 • 46

S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information

Paper • 2503.05085 • Published Mar 7 • 48

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

Paper • 2503.05379 • Published Mar 7 • 37

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

Paper • 2503.05639 • Published Mar 7 • 24

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Paper • 2503.05638 • Published Mar 7 • 19

MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

Paper • 2503.07365 • Published Mar 10 • 62

Automated Movie Generation via Multi-Agent CoT Planning

Paper • 2503.07314 • Published Mar 10 • 45

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Paper • 2503.07002 • Published Mar 10 • 40

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Paper • 2503.06749 • Published Mar 9 • 30

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Paper • 2503.07920 • Published Mar 10 • 98

MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

Paper • 2503.05978 • Published Mar 7 • 35

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Paper • 2503.07536 • Published Mar 10 • 86

Video Action Differencing

Paper • 2503.07860 • Published Mar 10 • 33

UniF^2ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

Paper • 2503.08120 • Published Mar 11 • 31

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Paper • 2503.08625 • Published Mar 11 • 26

Implicit Reasoning in Transformers is Reasoning through Shortcuts

Paper • 2503.07604 • Published Mar 10 • 22

LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

Paper • 2503.08619 • Published Mar 11 • 20

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Paper • 2503.07027 • Published Mar 10 • 29

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

Paper • 2503.04812 • Published Mar 4 • 15

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

Paper • 2503.02199 • Published Mar 4 • 8

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Paper • 2503.07703 • Published Mar 10 • 36

Gemini Embedding: Generalizable Embeddings from Gemini

Paper • 2503.07891 • Published Mar 10 • 39

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Paper • 2503.08686 • Published Mar 11 • 19

CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing

Paper • 2503.06940 • Published Mar 10 • 11

Transformers without Normalization

Paper • 2503.10622 • Published Mar 13 • 164

Charting and Navigating Hugging Face's Model Atlas

Paper • 2503.10633 • Published Mar 13 • 83

World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

Paper • 2503.10480 • Published Mar 13 • 53

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 50

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13 • 36

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

Paper • 2503.10437 • Published Mar 13 • 32

CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

Paper • 2503.09662 • Published Mar 12 • 34

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Paper • 2503.08677 • Published Mar 11 • 29

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Paper • 2503.10460 • Published Mar 13 • 29

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Paper • 2503.10596 • Published Mar 13 • 18

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13 • 17

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Paper • 2503.09642 • Published Mar 12 • 18

PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

Paper • 2503.07677 • Published Mar 10 • 86

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Paper • 2503.11647 • Published Mar 14 • 141

GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

Paper • 2503.05689 • Published Mar 7 • 3

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Paper • 2503.11576 • Published Mar 14 • 108

Large-scale Pre-training for Grounded Video Caption Generation

Paper • 2503.10781 • Published Mar 13 • 17

ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

Paper • 2503.06542 • Published Mar 9 • 8

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Paper • 2503.06053 • Published Mar 8 • 138

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Paper • 2503.12533 • Published Mar 16 • 66

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

Paper • 2503.12885 • Published Mar 17 • 44

Edit Transfer: Learning Image Editing via Vision In-Context Relations

Paper • 2503.13327 • Published Mar 17 • 29

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

Paper • 2503.13434 • Published Mar 17 • 27

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Paper • 2503.12937 • Published Mar 17 • 29

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Paper • 2503.12605 • Published Mar 16 • 34

DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17 • 30

Aligning Multimodal LLM with Human Preference: A Survey

Paper • 2503.14504 • Published Mar 18 • 25

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Paper • 2503.14492 • Published Mar 18 • 18

TULIP: Towards Unified Language-Image Pretraining

Paper • 2503.15485 • Published Mar 19 • 49

φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

Paper • 2503.13288 • Published Mar 17 • 51

Temporal Regularization Makes Your Video Generator Stronger

Paper • 2503.15417 • Published Mar 19 • 22

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

Paper • 2503.11557 • Published Mar 14 • 21

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Paper • 2503.16419 • Published Mar 20 • 74

Unleashing Vecset Diffusion Model for Fast Shape Generation

Paper • 2503.16302 • Published Mar 20 • 44

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

Paper • 2503.14487 • Published Mar 18 • 27

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Paper • 2503.16365 • Published Mar 20 • 41

InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Paper • 2503.16418 • Published Mar 20 • 35

Ultra-Resolution Adaptation with Ease

Paper • 2503.16322 • Published Mar 20 • 13

M3: 3D-Spatial MultiModal Memory

Paper • 2503.16413 • Published Mar 20 • 15

See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

Paper • 2503.13834 • Published Mar 18 • 5

Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Paper • 2503.16057 • Published Mar 20 • 14

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Paper • 2503.14476 • Published Mar 18 • 128

RWKV-7 "Goose" with Expressive Dynamic State Evolution

Paper • 2503.14456 • Published Mar 18 • 149

Impossible Videos

Paper • 2503.14378 • Published Mar 18 • 62

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

Paper • 2503.16219 • Published Mar 20 • 51

Inside-Out: Hidden Factual Knowledge in LLMs

Paper • 2503.15299 • Published Mar 19 • 55

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Paper • 2503.15558 • Published Mar 18 • 49

Where do Large Vision-Language Models Look at when Answering Questions?

Paper • 2503.13891 • Published Mar 18 • 8

MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving

Paper • 2503.16905 • Published Mar 21 • 54

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Paper • 2503.17352 • Published Mar 21 • 23

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Paper • 2503.16430 • Published Mar 20 • 35

When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO

Paper • 2503.16921 • Published Mar 21 • 6

From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

Paper • 2503.12821 • Published Mar 17 • 9

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

Paper • 2503.16549 • Published Mar 19 • 15

Why Do Multi-Agent LLM Systems Fail?

Paper • 2503.13657 • Published Mar 17 • 47

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Paper • 2503.16660 • Published Mar 20 • 73

Can Large Vision Language Models Read Maps Like a Human?

Paper • 2503.14607 • Published Mar 18 • 9

GAEA: A Geolocation Aware Conversational Model

Paper • 2503.16423 • Published Mar 20 • 6

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24 • 118

Video-T1: Test-Time Scaling for Video Generation

Paper • 2503.18942 • Published Mar 24 • 88

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Paper • 2503.18892 • Published Mar 24 • 30

Aether: Geometric-Aware Unified World Modeling

Paper • 2503.18945 • Published Mar 24 • 28

Judge Anything: MLLM as a Judge Across Any Modality

Paper • 2503.17489 • Published Mar 21 • 20

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Paper • 2503.18013 • Published Mar 23 • 19

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Paper • 2503.18071 • Published Mar 23 • 3

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Paper • 2503.19622 • Published Mar 25 • 31

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Paper • 2503.18931 • Published Mar 24 • 30

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Paper • 2503.19325 • Published Mar 25 • 73

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Paper • 2503.14905 • Published Mar 19 • 20

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

Paper • 2503.16965 • Published Mar 21 • 4

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Paper • 2503.19990 • Published Mar 25 • 34

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Paper • 2503.19757 • Published Mar 25 • 50

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Paper • 2503.19480 • Published Mar 25 • 16

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 156

Wan: Open and Advanced Large-Scale Video Generative Models

Paper • 2503.20314 • Published Mar 26 • 51

Open Deep Search: Democratizing Search with Open-source Reasoning Agents

Paper • 2503.20201 • Published Mar 26 • 47

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Paper • 2503.20198 • Published Mar 26 • 4

Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27 • 78

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27 • 62

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27 • 22

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Paper • 2503.21614 • Published Mar 27 • 39

OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning

Paper • 2503.16081 • Published Mar 20 • 26

Your ViT is Secretly an Image Segmentation Model

Paper • 2503.19108 • Published Mar 24 • 21

On Large Multimodal Models as Open-World Image Classifiers

Paper • 2503.21851 • Published Mar 27 • 5

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Paper • 2503.23461 • Published Mar 30 • 95

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Paper • 2503.24379 • Published Mar 31 • 77

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Paper • 2503.24376 • Published Mar 31 • 38

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

Paper • 2504.00595 • Published Apr 1 • 36

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

Paper • 2504.00509 • Published Apr 1 • 22

MoCha: Towards Movie-Grade Talking Character Synthesis

Paper • 2503.23307 • Published Mar 30 • 134

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Paper • 2503.24290 • Published Mar 31 • 62

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

Paper • 2503.22655 • Published Mar 28 • 39

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Paper • 2503.19901 • Published Mar 25 • 41

Expanding RL with Verifiable Rewards Across Diverse Domains

Paper • 2503.23829 • Published Mar 31 • 22

Multi-Token Attention

Paper • 2504.00927 • Published Apr 1 • 51

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Paper • 2503.22952 • Published Mar 29 • 18

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Paper • 2504.00557 • Published Apr 1 • 15

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Paper • 2504.00072 • Published Mar 31 • 7

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Paper • 2504.00999 • Published Apr 1 • 92

Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Paper • 2504.00883 • Published Apr 1 • 64

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Paper • 2504.01724 • Published Apr 2 • 67

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

Paper • 2504.01014 • Published Apr 1 • 70

Towards Physically Plausible Video Generation via VLM Planning

Paper • 2503.23368 • Published Mar 30 • 40

Understanding R1-Zero-Like Training: A Critical Perspective

Paper • 2503.20783 • Published Mar 26 • 50

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Paper • 2504.01934 • Published Apr 2 • 23

Articulated Kinematics Distillation from Video Diffusion Models

Paper • 2504.01204 • Published Apr 1 • 24

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Paper • 2504.01308 • Published Apr 2 • 13

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Paper • 2503.23573 • Published Mar 30 • 12

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Paper • 2503.18817 • Published Mar 24 • 4

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Paper • 2504.02826 • Published Apr 3 • 68

WikiVideo: Article Generation from Multiple Videos

Paper • 2504.00939 • Published Apr 1 • 36

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Paper • 2504.02782 • Published Apr 3 • 57

Inference-Time Scaling for Generalist Reward Modeling

Paper • 2504.02495 • Published Apr 3 • 54

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Paper • 2504.02587 • Published Apr 3 • 30

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Paper • 2504.00502 • Published Apr 1 • 22

VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

Paper • 2504.02949 • Published Apr 3 • 21

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Paper • 2504.03641 • Published Apr 4 • 14

Slow-Fast Architecture for Video Multi-Modal Large Language Models

Paper • 2504.01328 • Published Apr 2 • 8

URECA: Unique Region Caption Anything

Paper • 2504.05305 • Published Apr 7 • 36

Concept Lancet: Image Editing with Compositional Representation Transplant

Paper • 2504.02828 • Published Apr 3 • 17

LiveVQA: Live Visual Knowledge Seeking

Paper • 2504.05288 • Published Apr 7 • 15

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 188

Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

Paper • 2504.03151 • Published Apr 4 • 14

Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

Paper • 2504.05594 • Published Apr 8 • 12

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Paper • 2504.05599 • Published Apr 8 • 83

Rethinking Reflection in Pre-Training

Paper • 2504.04022 • Published Apr 5 • 79

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Paper • 2504.03193 • Published Apr 4 • 5

OmniSVG: A Unified Scalable Vector Graphics Generation Model

Paper • 2504.06263 • Published Apr 8 • 167

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

Paper • 2504.06148 • Published Apr 8 • 13

OmniCaptioner: One Captioner to Rule Them All

Paper • 2504.07089 • Published Apr 9 • 20

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Paper • 2504.05541 • Published Apr 7 • 16

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Paper • 2504.06958 • Published Apr 9 • 11

Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10 • 125

DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning

Paper • 2504.07128 • Published Apr 2 • 85

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

Paper • 2504.07956 • Published Apr 10 • 46

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Paper • 2504.07960 • Published Apr 10 • 48

MM-IFEngine: Towards Multimodal Instruction Following

Paper • 2504.07957 • Published Apr 10 • 34

Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models

Paper • 2504.07951 • Published Apr 10 • 28

Towards Visual Text Grounding of Multimodal Large Language Model

Paper • 2504.04974 • Published Apr 7 • 16

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Paper • 2504.08685 • Published Apr 11 • 125

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Paper • 2504.08736 • Published Apr 11 • 47

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Paper • 2504.08388 • Published Apr 11 • 40

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Paper • 2504.07615 • Published Apr 10 • 32

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Paper • 2504.08837 • Published Apr 10 • 43

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Paper • 2504.09925 • Published Apr 14 • 38

Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

Paper • 2504.08003 • Published Apr 9 • 49

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14 • 268

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Paper • 2504.10068 • Published Apr 14 • 30

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Paper • 2504.09641 • Published Apr 13 • 16

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Paper • 2504.09130 • Published Apr 12 • 12

Reasoning Models Can Be Effective Without Thinking

Paper • 2504.09858 • Published Apr 14 • 11

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Paper • 2504.10462 • Published Apr 14 • 15

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

Paper • 2504.10465 • Published Apr 14 • 28

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Paper • 2504.13169 • Published Apr 17 • 39

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Paper • 2504.13122 • Published Apr 17 • 21

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Paper • 2504.11468 • Published Apr 10 • 28

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Paper • 2504.10342 • Published Apr 14 • 11

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Paper • 2504.10443 • Published Apr 14 • 4

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Paper • 2504.10049 • Published Apr 14 • 3

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Paper • 2504.10514 • Published Apr 10 • 46

Perception Encoder: The best visual embeddings are not at the output of the network

Paper • 2504.13181 • Published Apr 17 • 34

NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

Paper • 2504.13055 • Published Apr 17 • 19

DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

Paper • 2504.12364 • Published Apr 16 • 21

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Paper • 2504.13180 • Published Apr 17 • 17

Could Thinking Multilingually Empower LLM Reasoning?

Paper • 2504.11833 • Published Apr 16 • 28

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Paper • 2504.13837 • Published Apr 18 • 127

UFO2: The Desktop AgentOS

Paper • 2504.14603 • Published Apr 20 • 28

LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

Paper • 2504.13805 • Published Apr 18 • 12

Vidi: Large Multimodal Models for Video Understanding and Editing

Paper • 2504.15681 • Published Apr 22 • 15

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Paper • 2504.16030 • Published Apr 22 • 34

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Paper • 2504.15280 • Published Apr 21 • 23

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Paper • 2504.14239 • Published Apr 19 • 13

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Paper • 2504.15415 • Published Apr 21 • 22

Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22 • 60

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Paper • 2504.15271 • Published Apr 21 • 65

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Paper • 2504.13157 • Published Apr 17 • 21

Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Paper • 2504.14899 • Published Apr 21 • 21

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Paper • 2504.15270 • Published Apr 21 • 10

BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

Paper • 2504.14538 • Published Apr 20 • 29

Personalized Text-to-Image Generation with Auto-Regressive Models

Paper • 2504.13162 • Published Apr 17 • 19

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Paper • 2504.16080 • Published Apr 22 • 15

MR. Video: "MapReduce" is the Principle for Long Video Understanding

Paper • 2504.16082 • Published Apr 22 • 6

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Paper • 2504.15279 • Published Apr 21 • 74

DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

Paper • 2504.14509 • Published Apr 20 • 51

Trillion 7B Technical Report

Paper • 2504.15431 • Published Apr 21 • 35

I-Con: A Unifying Framework for Representation Learning

Paper • 2504.16929 • Published Apr 23 • 30

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Paper • 2504.16074 • Published Apr 22 • 36

DreamO: A Unified Framework for Image Customization

Paper • 2504.16915 • Published Apr 23 • 25

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

Paper • 2504.16145 • Published Apr 22 • 2

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Paper • 2504.17192 • Published Apr 24 • 110

Step1X-Edit: A Practical Framework for General Image Editing

Paper • 2504.17761 • Published Apr 24 • 88

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

Paper • 2504.17502 • Published Apr 24 • 56

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Paper • 2504.17432 • Published Apr 24 • 39

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Paper • 2504.17207 • Published Apr 24 • 29

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

Paper • 2504.17040 • Published Apr 23 • 13

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Paper • 2504.16064 • Published Apr 22 • 14

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24 • 12

ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

Paper • 2504.15921 • Published Apr 22 • 7

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Paper • 2504.16656 • Published Apr 23 • 57

Towards Understanding Camera Motions in Any Video

Paper • 2504.15376 • Published Apr 21 • 157

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Paper • 2504.16427 • Published Apr 23 • 17

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Paper • 2504.12080 • Published Apr 16 • 7

Contrastive Localized Language-Image Pre-Training

Paper • 2410.02746 • Published Oct 3, 2024 • 38

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Paper • 2409.19291 • Published Sep 28, 2024 • 20

GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

Paper • 2410.03645 • Published Oct 4, 2024 • 3

LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

Paper • 2504.19838 • Published Apr 28 • 21

RepText: Rendering Visual Text via Replicating

Paper • 2504.19724 • Published Apr 28 • 30

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Paper • 2504.18589 • Published Apr 24 • 11

Clinical knowledge in LLMs does not translate to human interactions

Paper • 2504.18919 • Published Apr 26 • 25

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

Paper • 2504.19162 • Published Apr 27 • 16

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

Paper • 2504.16083 • Published Apr 22 • 9

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Paper • 2504.19854 • Published Apr 28 • 7

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Paper • 2504.20571 • Published Apr 29 • 93

The Leaderboard Illusion

Paper • 2504.20879 • Published Apr 29 • 70

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Paper • 2504.20734 • Published Apr 29 • 62

YoChameleon: Personalized Vision and Language Generation

Paper • 2504.20998 • Published Apr 29 • 11

X-Fusion: Introducing New Modality to Frozen Large Language Models

Paper • 2504.20996 • Published Apr 29 • 12

A Review of 3D Object Detection with Vision-Language Models

Paper • 2504.18738 • Published Apr 25 • 2

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Paper • 2504.21233 • Published Apr 30 • 45

100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

Paper • 2505.00551 • Published May 1 • 37

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Paper • 2504.21850 • Published Apr 30 • 26

ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction

Paper • 2504.21855 • Published Apr 30 • 12

A Survey of Interactive Generative Video

Paper • 2504.21853 • Published Apr 30 • 47

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Paper • 2505.00703 • Published May 1 • 42

PixelHacker: Image Inpainting with Structural and Semantic Consistency

Paper • 2504.20438 • Published Apr 29 • 43

Improving Editability in Image Generation with Layer-wise Memory

Paper • 2505.01079 • Published May 2 • 28

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Paper • 2505.02707 • Published May 5 • 82

RM-R1: Reward Modeling as Reasoning

Paper • 2505.02387 • Published May 5 • 75

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Paper • 2505.02835 • Published May 5 • 25

Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

Paper • 2505.02471 • Published May 5 • 12

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Paper • 2505.02370 • Published May 5 • 14

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Paper • 2505.01441 • Published Apr 28 • 36

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Paper • 2505.02625 • Published May 5 • 22

HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

Paper • 2504.21650 • Published Apr 30 • 15

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Paper • 2505.03318 • Published about 1 month ago • 92

OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

Paper • 2505.03570 • Published about 1 month ago • 7

OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

Paper • 2505.04606 • Published 29 days ago • 7

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Paper • 2505.04601 • Published 29 days ago • 26

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Paper • 2505.03821 • Published May 3 • 24

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Paper • 2505.04512 • Published 29 days ago • 35

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Paper • 2505.04588 • Published 29 days ago • 64

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Paper • 2505.02567 • Published May 5 • 74

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Paper • 2505.02836 • Published May 5 • 7

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Paper • 2505.03739 • Published about 1 month ago • 8

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Paper • 2505.04921 • Published 29 days ago • 176

On Path to Multimodal Generalist: General-Level and General-Bench

Paper • 2505.04620 • Published 29 days ago • 79

Flow-GRPO: Training Flow Matching Models via Online RL

Paper • 2505.05470 • Published 28 days ago • 77

FG-CLIP: Fine-Grained Visual and Textual Alignment

Paper • 2505.05071 • Published 29 days ago • 17

X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

Paper • 2505.03981 • Published about 1 month ago • 14

Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

Paper • 2505.04769 • Published 29 days ago • 8

Bielik v3 Small: Technical Report

Paper • 2505.02550 • Published May 5 • 65

Bielik 11B v2 Technical Report

Paper • 2505.02410 • Published May 5 • 54

Seed1.5-VL Technical Report

Paper • 2505.07062 • Published 25 days ago • 143

Unified Continuous Generative Models

Paper • 2505.07447 • Published 25 days ago • 43

DanceGRPO: Unleashing GRPO on Visual Generation

Paper • 2505.07818 • Published 24 days ago • 29

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

Paper • 2505.07263 • Published 25 days ago • 29

H^{3}DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning

Paper • 2505.07819 • Published 24 days ago • 5

MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

Paper • 2505.06176 • Published 27 days ago • 11

DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Paper • 2505.07233 • Published 25 days ago • 8

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Paper • 2505.07916 • Published 25 days ago • 124

Fast Text-to-Audio Generation with Adversarial Post-Training

Paper • 2505.08175 • Published 24 days ago • 22

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Paper • 2505.05464 • Published 28 days ago • 10

Aya Vision: Advancing the Frontier of Multilingual Multimodality

Paper • 2505.08751 • Published 23 days ago • 11

SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Paper • 2505.08665 • Published 23 days ago • 4

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Paper • 2505.09568 • Published 22 days ago • 90

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

Paper • 2505.09343 • Published 23 days ago • 63

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Paper • 2505.10557 • Published 21 days ago • 46

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Paper • 2505.04410 • Published 30 days ago • 44

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Paper • 2505.09558 • Published 22 days ago • 9

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Paper • 2505.09439 • Published 23 days ago • 8

VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

Paper • 2505.08455 • Published 24 days ago • 4

Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA

Paper • 2505.06356 • Published 27 days ago • 3

Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

Paper • 2505.10554 • Published 21 days ago • 118

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Paper • 2505.08617 • Published 24 days ago • 41

WorldPM: Scaling Human Preference Modeling

Paper • 2505.10527 • Published 21 days ago • 33

End-to-End Vision Tokenizer Tuning

Paper • 2505.10562 • Published 21 days ago • 20

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Paper • 2505.10046 • Published 22 days ago • 9

AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

Paper • 2505.09926 • Published 22 days ago • 6

Qwen3 Technical Report

Paper • 2505.09388 • Published 23 days ago • 181

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Paper • 2505.10610 • Published 21 days ago • 53

GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

Paper • 2505.11049 • Published 21 days ago • 59

Visual Planning: Let's Think Only with Images

Paper • 2505.11409 • Published 20 days ago • 55

Simple Semi-supervised Knowledge Distillation from Vision-Language Models via texttt{D}ual-texttt{H}ead texttt{O}ptimization

Paper • 2505.07675 • Published 24 days ago • 19

Chain-of-Model Learning for Language Model

Paper • 2505.11820 • Published 20 days ago • 116

AdaptThink: Reasoning Models Can Learn When to Think

Paper • 2505.13417 • Published 17 days ago • 78

Model Merging in Pre-training of Large Language Models

Paper • 2505.12082 • Published 19 days ago • 35

Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images

Paper • 2505.07704 • Published 24 days ago • 29

Faster Video Diffusion with Trainable Sparse Attention

Paper • 2505.13389 • Published 17 days ago • 35

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Paper • 2505.13180 • Published 18 days ago • 13

VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

Paper • 2505.12081 • Published 19 days ago • 17

R3: Robust Rubric-Agnostic Reward Models

Paper • 2505.13388 • Published 17 days ago • 11

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Paper • 2505.13181 • Published 18 days ago • 9

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Paper • 2505.12781 • Published 18 days ago • 2

Emerging Properties in Unified Multimodal Pretraining

Paper • 2505.14683 • Published 16 days ago • 129

Reward Reasoning Model

Paper • 2505.14674 • Published 16 days ago • 34

Visual Agentic Reinforcement Fine-Tuning

Paper • 2505.14246 • Published 17 days ago • 31

VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Paper • 2505.14460 • Published 16 days ago • 30

Think Only When You Need with Large Hybrid-Reasoning Models

Paper • 2505.14631 • Published 16 days ago • 19

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Paper • 2505.14677 • Published 16 days ago • 15

Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

Paper • 2505.14135 • Published 17 days ago • 14

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

Paper • 2505.14640 • Published 16 days ago • 14

Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training

Paper • 2505.14681 • Published 16 days ago • 9

Visual Instruction Bottleneck Tuning

Paper • 2505.13946 • Published 17 days ago • 10

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

Paper • 2505.14464 • Published 16 days ago • 8

Lessons from Defending Gemini Against Indirect Prompt Injections

Paper • 2505.14534 • Published 16 days ago • 8

The Hallucination Tax of Reinforcement Finetuning

Paper • 2505.13988 • Published 17 days ago • 8

Incorporating brain-inspired mechanisms for multimodal learning in artificial intelligence

Paper • 2505.10176 • Published 22 days ago • 3

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Paper • 2505.15277 • Published 16 days ago • 98

MMaDA: Multimodal Large Diffusion Language Models

Paper • 2505.15809 • Published 15 days ago • 85

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Paper • 2505.14231 • Published 17 days ago • 51

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Paper • 2505.15045 • Published 16 days ago • 53

Vid2World: Crafting Video Diffusion Models to Interactive World Models

Paper • 2505.14357 • Published 17 days ago • 25

When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning

Paper • 2505.15400 • Published 16 days ago • 22

lmgame-Bench: How Good are LLMs at Playing Games?

Paper • 2505.15146 • Published 16 days ago • 20

IA-T2I: Internet-Augmented Text-to-Image Generation

Paper • 2505.15779 • Published 15 days ago • 15

Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

Paper • 2505.15210 • Published 16 days ago • 15

RLVR-World: Training World Models with Reinforcement Learning

Paper • 2505.13934 • Published 17 days ago • 14

ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning

Paper • 2505.15776 • Published 15 days ago • 10

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Paper • 2505.11454 • Published 20 days ago • 5

QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Paper • 2505.16175 • Published 15 days ago • 39

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Paper • 2505.16933 • Published 14 days ago • 30

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Paper • 2505.17022 • Published 14 days ago • 26

Risk-Averse Reinforcement Learning with Itakura-Saito Loss

Paper • 2505.16925 • Published 14 days ago • 24

Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Paper • 2505.16181 • Published 15 days ago • 23

Training-Free Efficient Video Generation via Dynamic Token Carving

Paper • 2505.16864 • Published 14 days ago • 21

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Paper • 2505.16990 • Published 14 days ago • 20

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Paper • 2505.15952 • Published 15 days ago • 19

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Paper • 2505.17018 • Published 14 days ago • 14

Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Paper • 2505.16916 • Published 14 days ago • 16

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

Paper • 2505.16421 • Published 15 days ago • 18

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Paper • 2505.16839 • Published 14 days ago • 12

GRIT: Teaching MLLMs to Think with Images

Paper • 2505.15879 • Published 15 days ago • 12

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Paper • 2505.16854 • Published 14 days ago • 11

OViP: Online Vision-Language Preference Learning

Paper • 2505.15963 • Published 15 days ago • 9

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Paper • 2505.17015 • Published 14 days ago • 8

VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Paper • 2505.16192 • Published 15 days ago • 8

Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

Paper • 2505.15517 • Published 16 days ago • 4

How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

Paper • 2505.15865 • Published 16 days ago • 4

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Paper • 2505.14462 • Published 16 days ago • 4

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Paper • 2505.18129 • Published 13 days ago • 59

Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

Paper • 2505.17558 • Published 14 days ago • 15

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Paper • 2505.16211 • Published 15 days ago • 18

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

Paper • 2505.15389 • Published 16 days ago • 8

Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

Paper • 2505.18536 • Published 13 days ago • 18

QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

Paper • 2505.17667 • Published 14 days ago • 84

Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models

Paper • 2505.17225 • Published 14 days ago • 64

QwenLong-CPRS: Towards infty-LLMs with Dynamic Context Optimization

Paper • 2505.18092 • Published 13 days ago • 41

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Paper • 2505.16770 • Published 14 days ago • 10

Interactive Post-Training for Vision-Language-Action Models

Paper • 2505.17016 • Published 14 days ago • 6

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Paper • 2505.13426 • Published 17 days ago • 12

Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

Paper • 2505.19706 • Published 11 days ago • 3

RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

Paper • 2505.17540 • Published 14 days ago • 7

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Paper • 2505.19147 • Published 12 days ago • 143

Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Paper • 2505.18675 • Published 13 days ago • 23

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Paper • 2505.20256 • Published 10 days ago • 17

REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

Paper • 2505.20046 • Published 11 days ago • 17

Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Paper • 2505.20152 • Published 10 days ago • 11

Interleaved Reasoning for Large Language Models via Reinforcement Learning

Paper • 2505.19640 • Published 11 days ago • 12

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Paper • 2505.10887 • Published 21 days ago • 10

STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

Paper • 2505.15804 • Published 15 days ago • 8

Jodi: Unification of Visual Generation and Understanding via Joint Modeling

Paper • 2505.19084 • Published 12 days ago • 20

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Paper • 2505.15957 • Published 15 days ago • 2

Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

Paper • 2505.20236 • Published 10 days ago • 2

Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models

Paper • 2505.14071 • Published 17 days ago • 1

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Paper • 2505.21497 • Published 9 days ago • 91

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Paper • 2505.21374 • Published 9 days ago • 27

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Paper • 2505.21333 • Published 9 days ago • 39

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Paper • 2505.21327 • Published 9 days ago • 81

MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks

Paper • 2505.16459 • Published 15 days ago • 45

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Paper • 2505.19099 • Published 12 days ago • 8

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Paper • 2505.21457 • Published 9 days ago • 14

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

Paper • 2505.21496 • Published 9 days ago • 38

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Paper • 2505.21500 • Published 9 days ago • 11

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Paper • 2505.19897 • Published 11 days ago • 101

MLLMs are Deeply Affected by Modality Bias

Paper • 2505.18657 • Published 13 days ago • 5

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Paper • 2505.22453 • Published 8 days ago • 45

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Paper • 2505.22334 • Published 9 days ago • 36

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Paper • 2505.22617 • Published 8 days ago • 114

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Paper • 2505.21600 • Published 9 days ago • 68

Skywork Open Reasoner 1 Technical Report

Paper • 2505.22312 • Published 9 days ago • 52

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Paper • 2505.22651 • Published 8 days ago • 50

Fostering Video Reasoning via Next-Event Prediction

Paper • 2505.22457 • Published 8 days ago • 27

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Paper • 2505.22019 • Published 9 days ago • 10

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Paper • 2505.22613 • Published 8 days ago • 7

Zero-Shot Vision Encoder Grafting via LLM Surrogates

Paper • 2505.22664 • Published 8 days ago • 5

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Paper • 2505.20298 • Published 10 days ago • 6

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Paper • 2505.23747 • Published 7 days ago • 66

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published 7 days ago • 45

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Paper • 2505.22653 • Published 8 days ago • 64

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Paper • 2505.23693 • Published 7 days ago • 56

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Paper • 2505.23359 • Published 8 days ago • 39

To Trust Or Not To Trust Your Vision-Language Model's Prediction

Paper • 2505.23745 • Published 7 days ago • 5

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Paper • 2505.22943 • Published 8 days ago • 4

FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

Paper • 2505.17330 • Published 14 days ago • 22

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Paper • 2505.20444 • Published 10 days ago • 3

cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

Paper • 2505.22914 • Published 8 days ago • 28

Are Reasoning Models More Prone to Hallucination?

Paper • 2505.23646 • Published 7 days ago • 24

Multi-Domain Explainability of Preferences

Paper • 2505.20088 • Published 10 days ago • 21

REOrdering Patches Improves Vision Models

Paper • 2505.23751 • Published 7 days ago • 16

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Paper • 2505.22918 • Published 8 days ago • 7

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Paper • 2505.23759 • Published 7 days ago • 5

A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models

Paper • 2505.19286 • Published 11 days ago • 4

Grounded Reinforcement Learning for Visual Reasoning

Paper • 2505.23678 • Published 7 days ago • 3

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Paper • 2505.24867 • Published 6 days ago • 72

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Paper • 2505.24863 • Published 6 days ago • 84

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Paper • 2505.24864 • Published 6 days ago • 112

Large Language Models for Data Synthesis

Paper • 2505.14752 • Published 17 days ago • 47

Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Paper • 2505.18842 • Published 12 days ago • 35

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Paper • 2505.24862 • Published 6 days ago • 30

DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Paper • 2505.24025 • Published 7 days ago • 23

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

Paper • 2505.24871 • Published 6 days ago • 18

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Paper • 2505.24878 • Published 6 days ago • 21

Vision Language Models are Biased

Paper • 2505.23941 • Published 7 days ago • 17

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Paper • 2505.21523 • Published 14 days ago • 14

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

Paper • 2505.20873 • Published 10 days ago • 10

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Paper • 2505.24875 • Published 6 days ago • 10

un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Paper • 2505.24517 • Published 7 days ago • 5

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published 3 days ago • 68

Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

Paper • 2506.00996 • Published 5 days ago • 34

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Paper • 2505.23590 • Published 7 days ago • 24

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

Paper • 2506.00411 • Published 6 days ago • 27

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Paper • 2506.01713 • Published 4 days ago • 30

EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

Paper • 2506.01667 • Published 4 days ago • 20

VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

Paper • 2505.23977 • Published 7 days ago • 9

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Paper • 2505.24625 • Published 7 days ago • 8

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Paper • 2505.21724 • Published 9 days ago • 4

Aligning VLM Assistants with Personalized Situated Cognition

Paper • 2506.00930 • Published 5 days ago • 2

MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

Paper • 2506.01674 • Published 4 days ago • 23

SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

Paper • 2506.02096 • Published 3 days ago • 49

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

Paper • 2506.02387 • Published 3 days ago • 55

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Paper • 2506.03147 • Published 2 days ago • 55

CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

Paper • 2505.24120 • Published 7 days ago • 46

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Paper • 2506.03135 • Published 2 days ago • 33

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Paper • 2506.00123 • Published 6 days ago • 31

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Paper • 2506.03143 • Published 2 days ago • 34

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Paper • 2506.03096 • Published 2 days ago • 3

TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence

Paper • 2505.24500 • Published 7 days ago • 11