Abstract
Vision language models exhibit strong biases in counting and identification tasks, demonstrating a failure mode that persist even with additional instructions or context.
Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g, unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.
Community
2505.23941
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models (2025)
- Vision language models are unreliable at trivial spatial cognition (2025)
- VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models (2025)
- MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly (2025)
- Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models (2025)
- MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models (2025)
- PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper