Papers
arxiv:2504.21117

Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Published on Apr 29
· Submitted by gowitheflow on May 5
#3 Paper of the day
Authors:
,
,

Abstract

Evaluating natural language generation (NLG) systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluation offers a scalable alternative but is highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.

Community

Paper submitter

Although using LLM-as-a-judge has become an inevitable practice, evaluation prompt crafting has mostly been relied on manual writing or LLM generation. We propose a framework to generate effective NLG evaluation prompts through inversion modeling. By training an inverse model specific to each "forward model", we can recover a prompt that works best for the forward model by giving the inverse model a one-shot example - a score that we want for an example text to be evaluated. We show that this approach works well and generalizes well to evaluating texts of larger scale, largely outperforming human-crafted prompts and prompts generated by forward models.

Excited to share our Inverse-Qwen model here!
https://huggingface.co/kou199024/Inverse-Qwen2.5-7B-BlackBox.
More inverse models on the way.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.21117 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.21117 in a Space README.md to link it from this page.

Collections including this paper 1