arxiv:2509.26278

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Published on Sep 30

· Submitted by

Authors:

Abstract

ProfVLM, a compact vision-language model, uses generative reasoning to estimate skill proficiency and generate expert feedback from multi-view videos, outperforming existing methods with fewer parameters and faster training.

AI-generated summary

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

View arXiv page View PDF Add to collection

Community

EdBianchi

Paper submitter 9 days ago

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

We present ProfVLM, a vision-language model tailored for proficiency estimation, leveraging multi-view video encoders and a lightweight backbone language model. ProfVLM generates both a proficiency label and natural language feedback, bridging accurate assessment and practical usability.

Here's the paper highlights:

• Novel lightweight vision-language model architecture for multi-domain action quality assessment and proficiency estimation

• State-of-the-art performance achieved with 20x fewer parameters, 2x fewer frames, and 60% reduction in training time compared to existing baselines

• Efficient multi-view and multi-modal fusion architecture that generalizes across diverse domains

• Unified framework combining structured proficiency classification with open-ended conditional generation of expert-like feedback

librarian-bot

8 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.26278 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.26278 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.26278 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.