Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Abstract
A zero-finetuning framework uses Multimodal Large Language Models to inject high-level semantics into video recommendations, improving intent-awareness over traditional methods.
Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero parody with slapstick fights and orchestral stabs"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.
Community
I'm excited to share our latest accepted paper at Recsys 2025!
"Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations”, which gave me the chance to connect my Computer Vision and Recommendation system experience.
📌 The challenge: Video recommendations are challenging as we do not know what makes a video interesting to the user. Video encoders create features about "someone is dancing on a rooftop" but they are blind to e.g. cultural context that makes the video so cool (the dance parodies a 1990s superhero trope).
🧠 Our solution: We use Multi Modal Large Language Models to create rich descriptions of the videos, characters etc. Then, we plug everything into standard recommendation models through a lightweight text encoder.
✅ Key takeaway:
- pixels show what happens on-screen
- titles reflect what the uploader hopes will attract clicks
- but MLLM-generated text captures why viewers might care
- up to 60% gains in performance.
Hi Marco,
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Engagement Prediction of Short Videos with Large Multimodal Models (2025)
- Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation (2025)
- GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval (2025)
- Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation (2025)
- Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment (2025)
- Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation (2025)
- CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper