Papers
arxiv:2509.26278

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Published on Sep 30
· Submitted by Edoardo Bianchi on Oct 1
Authors:
,
,

Abstract

ProfVLM, a compact vision-language model, uses generative reasoning to estimate skill proficiency and generate expert feedback from multi-view videos, outperforming existing methods with fewer parameters and faster training.

AI-generated summary

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

Community

Paper submitter

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

We present ProfVLM, a vision-language model tailored for proficiency estimation, leveraging multi-view video encoders and a lightweight backbone language model. ProfVLM generates both a proficiency label and natural language feedback, bridging accurate assessment and practical usability.

Here's the paper highlights:

• Novel lightweight vision-language model architecture for multi-domain action quality assessment and proficiency estimation

• State-of-the-art performance achieved with 20x fewer parameters, 2x fewer frames, and 60% reduction in training time compared to existing baselines

• Efficient multi-view and multi-modal fusion architecture that generalizes across diverse domains

• Unified framework combining structured proficiency classification with open-ended conditional generation of expert-like feedback

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.26278 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.26278 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.26278 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.