LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Paper • 2506.21862 • Published 17 days ago • 35
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Paper • 2504.07615 • Published Apr 10 • 32
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Paper • 2504.07960 • Published Apr 10 • 49
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction Paper • 2501.03218 • Published Jan 6 • 37
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published Dec 13, 2024 • 146
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Paper • 2412.09596 • Published Dec 12, 2024 • 99
CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation Paper • 2306.04300 • Published Jun 7, 2023 • 2
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published Jun 17, 2024 • 25
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published Dec 6, 2024 • 160
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper • 2412.04467 • Published Dec 5, 2024 • 117
MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms Paper • 2410.18977 • Published Oct 24, 2024 • 15