arxiv:2506.03857

Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

Published on Jun 4

· Submitted by

MingxuanXia on Jun 16

Upvote

Authors:

Mingxuan Xia ,

Abstract

A novel candidate annotation paradigm using a teacher-student framework improves data quality for下游 applications by encouraging large language models to output multiple labels when uncertain.

AI-generated summary

Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.

View arXiv page View PDF GitHub repository Add to collection

Community

MingxuanXia

Paper author Paper submitter 7 days ago

This work studies LLM-driven data annotation by proposing a novel teacher-student framework, CanDist, which first prompts the teacher LLM to generate candidate labels and then distills a student SLM to identify the true labels. We illustrate that candidate annotations exhibit better statistical properties and theoretically justify that distilling from LLM's candidate annotations is more noise-tolerant. Empirically, we show that CanDist outperforms various LLM and SLM-based methods. We hope our work will inspire future research to exploit candidate annotations with weak annotators.

librarian-bot

6 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.03857 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.03857 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.03857 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.