Papers
arxiv:2506.02592

Beyond the Surface: Measuring Self-Preference in LLM Judgments

Published on Jun 3
· Submitted by JaxChen on Jun 5
Authors:
,
,
,

Abstract

The DBG score is introduced to measure self-preference bias in large language models by using gold judgments as proxies for response quality, addressing the confounding effect of response quality.

AI-generated summary

Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.

Community

Paper author Paper submitter
edited 1 day ago

In this work, we propose the DBG score to reliably measure the self-preference bias of LLM judges. Using this metric, we conduct comprehensive experiments to evaluate the self-preference bias of LLMs across different versions, sizes, and reasoning abilities. In addition, we explore two factors that influence and help mitigate self-preference bias: response text style and the post-training data of the judges. Finally, we investigate the underlying mechanisms of this bias from an attention-level perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.02592 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.02592 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.02592 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.