Abstract
A new fine-tuning strategy, STAT, uses a teacher model's metacognition to identify and address skill gaps in a student model, leading to improved performance on both in-distribution and out-of-distribution benchmarks.
Language models often show little to no improvement (i.e., "saturation") when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines. Our code is available at: https://github.com/princeton-pli/STAT.
Community
We introduce a new training paradigm, Skill-Targeted Adaptive Training (STAT), which offers a principled path to overcoming SFT saturation and advancing generalization in LLMs.
1️⃣ Current Bottleneck
Supervised fine-tuning (SFT) often plateaus when models are trained on data similar to their pretraining distribution — a phenomenon of saturation seen on benchmarks like MATH.
2️⃣ Our Approach: STAT
We introduce Skill-Targeted Adaptive Training (STAT), a new fine-tuning paradigm that leverages the metacognition of a stronger LLM as a teacher. The teacher identifies the skills required for a task, tracks where the student model struggles, and builds a Missing-Skill Profile.
• STAT-Sel adaptively reweights existing examples based on missing skills.
• STAT-Syn synthesizes new examples targeting those gaps.
3️⃣ Results
Across Llama and Qwen models, STAT achieves:
• +7.5% improvement on MATH (vs. minimal SFT gains)
• +4.6% average boost on out-of-distribution benchmarks (AIME24/25, AMC23, etc.)
Moreover, STAT is complementary with reinforcement learning (e.g., GRPO), showing that addressing skill gaps before RL can further amplify downstream gains.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning (2025)
- Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning (2025)
- PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning (2025)
- Representation-Based Exploration for Language Models: From Test-Time to Post-Training (2025)
- Aligning Large Language Models via Fully Self-Synthetic Data (2025)
- WST: Weak-to-Strong Knowledge Transfer via Reinforcement Learning (2025)
- Jointly Reinforcing Diversity and Quality in Language Model Generations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper