GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper • 2507.01006 • Published 6 days ago • 171
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Paper • 2506.21862 • Published 11 days ago • 34
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Paper • 2506.21277 • Published 11 days ago • 14
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding Paper • 2501.15111 • Published Jan 25 • 1
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding Paper • 2501.05067 • Published Jan 9 • 1
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness Paper • 2501.07978 • Published Jan 14
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding Paper • 2501.05067 • Published Jan 9 • 1
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding Paper • 2501.15111 • Published Jan 25 • 1
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Paper • 2506.21277 • Published 11 days ago • 14