VLRM: Vision-Language Models act as Reward Models for Image Captioning Paper • 2404.01911 • Published Apr 2, 2024 • 2
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization Paper • 2203.07086 • Published Mar 14, 2022