Annotation-Efficient Universal Honesty Alignment
Abstract
EliCal, a two-stage framework combining self-consistency supervision and minimal correctness annotations, achieves near-optimal honesty alignment in large language models with limited annotation effort.
Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.
Community
Annotation-Efficient Universal Honesty Alignment
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models (2025)
- Can Large Language Models Express Uncertainty Like Human? (2025)
- Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief (2025)
- ConfTuner: Training Large Language Models to Express Their Confidence Verbally (2025)
- Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning (2025)
- LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization (2025)
- The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper