Vision Language Models are Biased
Paper
โข
2505.23941
โข
Published
โข
20
Score image-text similarity using CLIP or SigLIP models
Segment images based on text prompts
Identify and mask objects in images using text prompts
Generate correspondences between images