A Single Character can Make or Break Your LLM Evals
Abstract
The choice of delimiter in formatting in-context examples significantly impacts the performance of large language models across different families and tasks.
Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by pm 23% depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs (2025)
- Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs (2025)
- On Robustness and Reliability of Benchmark-Based Evaluation of LLMs (2025)
- Promptception: How Sensitive Are Large Multimodal Models to Prompts? (2025)
- Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates (2025)
- BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses (2025)
- Benchmarking and Improving LLM Robustness for Personalized Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper