Beyond the Surface: Measuring Self-Preference in LLM Judgments
Abstract
The DBG score is introduced to measure self-preference bias in large language models by using gold judgments as proxies for response quality, addressing the confounding effect of response quality.
Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.
Community
In this work, we propose the DBG score to reliably measure the self-preference bias of LLM judges. Using this metric, we conduct comprehensive experiments to evaluate the self-preference bias of LLMs across different versions, sizes, and reasoning abilities. In addition, we explore two factors that influence and help mitigate self-preference bias: response text style and the post-training data of the judges. Finally, we investigate the underlying mechanisms of this bias from an attention-level perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Judging with Many Minds: Do More Perspectives Mean Less Prejudice? (2025)
- Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge (2025)
- CHARM: Calibrating Reward Models With Chatbot Arena Scores (2025)
- Assessing Judging Bias in Large Reasoning Models: An Empirical Study (2025)
- Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation (2025)
- J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning (2025)
- Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper