Caption Anything: Interactive Image Description with Diverse Multimodal Controls Paper • 2305.02677 • Published May 4, 2023
Video Understanding with Large Language Models: A Survey Paper • 2312.17432 • Published Dec 29, 2023 • 3
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models Paper • 2307.14061 • Published Jul 26, 2023
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning Paper • 2307.16525 • Published Jul 31, 2023
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models Paper • 2308.11186 • Published Aug 22, 2023 • 1
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos Paper • 2411.19772 • Published Nov 29, 2024
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers Paper • 2503.19480 • Published Mar 25 • 16
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation Paper • 2505.05422 • Published May 8 • 8
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? Paper • 2505.21374 • Published May 27 • 26