ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Abstract
ProfVLM, a compact vision-language model, uses generative reasoning to estimate skill proficiency and generate expert feedback from multi-view videos, outperforming existing methods with fewer parameters and faster training.
Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.
Community
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
We present ProfVLM, a vision-language model tailored for proficiency estimation, leveraging multi-view video encoders and a lightweight backbone language model. ProfVLM generates both a proficiency label and natural language feedback, bridging accurate assessment and practical usability.
Here's the paper highlights:
• Novel lightweight vision-language model architecture for multi-domain action quality assessment and proficiency estimation
• State-of-the-art performance achieved with 20x fewer parameters, 2x fewer frames, and 60% reduction in training time compared to existing baselines
• Efficient multi-view and multi-modal fusion architecture that generalizes across diverse domains
• Unified framework combining structured proficiency classification with open-ended conditional generation of expert-like feedback
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos (2025)
- Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization (2025)
- Multi-Level LVLM Guidance for Untrimmed Video Action Recognition (2025)
- FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding (2025)
- VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments (2025)
- Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping (2025)
- A Survey on Video Temporal Grounding with Multimodal Large Language Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper