Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback
Abstract
Three methods for multi-view proficiency estimation—SkillFormer, PATS, and ProfVLM—achieve state-of-the-art accuracy on Ego-Exo4D with reduced parameters and training epochs while enabling interpretable feedback generation.
Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.
Community
Advances in Action Quality Assessment
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos (2026)
- TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition (2026)
- Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition (2026)
- PriorNet: Prior-Guided Engagement Estimation from Face Video (2026)
- B-MoE: A Body-Part-Aware Mixture-of-Experts"All Parts Matter"Approach to Micro-Action Recognition (2026)
- ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model (2026)
- TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.03848 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper